Tuesday 13 January 2015

Removing leading hashtags from twitter data - Sample python script

In a recent study, we were trying to remove the leading hashtags from the twitter dumps.

The below python script proved handy. Sharing it with you all.
Please reuse and let us know if you have any comments.

Sample script:

import urlparse
import re
import csv

f1 = open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\Tweets_Corrected.csv","wb")
f1.close

csv_object = csv.reader(open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\twitter.csv"), delimiter = "~");


data = []
count = 0
for row in csv_object:
count +=1
new_string = ''
print row[2]
var = row[2]
for i in var.split():
s, n, p, pa, q, f = urlparse.urlparse(i)
#print i
if s and n:
pass
elif i[:1] == '@':
pass
elif i[:2] == '"@' or i[:2] == '.@':
pass
elif i[:1] == '#':
new_string = new_string.strip() + ' ' + i[1:]
# removing the leading hashtags from twitter data
else:
new_string = new_string.strip() + ' ' + i

wr_str = ''
wr_str = row[0] + '~'+ row[1] + '~' + new_string + '\n'


print new_string

Like us on Google+ or Facebook if you like our posts.

No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered