In a recent study, we were trying to remove the leading hashtags from the twitter dumps.
The below python script proved handy. Sharing it with you all.
Please reuse and let us know if you have any comments.
Sample script:
import urlparse
import re
import csv
f1 = open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\Tweets_Corrected.csv","wb")
f1.close
csv_object = csv.reader(open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\twitter.csv"), delimiter = "~");
data = []
count = 0
for row in csv_object:
count +=1
new_string = ''
print row[2]
var = row[2]
for i in var.split():
s, n, p, pa, q, f = urlparse.urlparse(i)
#print i
if s and n:
pass
elif i[:1] == '@':
pass
elif i[:2] == '"@' or i[:2] == '.@':
pass
elif i[:1] == '#':
new_string = new_string.strip() + ' ' + i[1:]
# removing the leading hashtags from twitter data
else:
new_string = new_string.strip() + ' ' + i
wr_str = ''
wr_str = row[0] + '~'+ row[1] + '~' + new_string + '\n'
print new_string
Like us on Google+ or Facebook if you like our posts.
The below python script proved handy. Sharing it with you all.
Please reuse and let us know if you have any comments.
Sample script:
import urlparse
import re
import csv
f1 = open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\Tweets_Corrected.csv","wb")
f1.close
csv_object = csv.reader(open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\twitter.csv"), delimiter = "~");
data = []
count = 0
for row in csv_object:
count +=1
new_string = ''
print row[2]
var = row[2]
for i in var.split():
s, n, p, pa, q, f = urlparse.urlparse(i)
#print i
if s and n:
pass
elif i[:1] == '@':
pass
elif i[:2] == '"@' or i[:2] == '.@':
pass
elif i[:1] == '#':
new_string = new_string.strip() + ' ' + i[1:]
# removing the leading hashtags from twitter data
else:
new_string = new_string.strip() + ' ' + i
wr_str = ''
wr_str = row[0] + '~'+ row[1] + '~' + new_string + '\n'
print new_string
Like us on Google+ or Facebook if you like our posts.
No comments:
Post a Comment
Please share your thoughts and let us know the topics you want covered