Friday 16 August 2013

1st Step towards Text-mining - A python solution to find combination variables in Community detection problem

With so much texts being generated everyday, it is now becoming absolutely important to read the text logs.
Mostly, chats will have lines separated by Newline ('\n'). But, we may want to decide the separator as necessary.

So, I wrote the below 2 scripts that read files to read text between the Separator. It can also search for the text in other files.

For example: 

We are searching for "Saturday Meetup" in all the chat conversations. Combined with the location "Downtown". 
Now, search these 2 variables occurring together in other chats. These users can be combined into the same Community in a Community model. So, they form similar groups for Targeted Interaction (sending gift-pass, coupons etc.)

Sample Methods for the task:

#! /bin/ksh/python

#----------------------------------------------------#
# Project : Python text-modification library         #
# Team: Data Sciences                                #
# Author :                             #
# Date: 23 June, 2013                                #
# Description: Generic classes based on Python       #
#              Each class has the usage and function #
#        defined in header section             #
#----------------------------------------------------#


#------------------------------------------------------#
# Generic class to read read all fields separated      #
# by a separator , for example: comma(,)               #
# Initiate the class using below syntax:               #
# x = DataBetweenSeparator(Script1, Separator)   #
# Use the method using:        #
# x.split(Script1, Separator)         #
#------------------------------------------------------#


# Call necessary libraries

class DataBetweenSeparator(object):
def __init__(self,scriptname,sep):
print "DataBetweenSeparator class initiated"
def split(self,scriptname,sep):
import re
import sys
import os.path
self.scriptname = scriptname
self.sep = sep
Sep = self.sep
ScriptName = self.scriptname
#SysArray = sys.argv
#ScriptName = SysArray[1]
ScriptName_tmp = ScriptName + "_tmp"

print "The source file for parsing is", ScriptName
print "The new file is", ScriptName_tmp

sourcefile = open(ScriptName, "r")
runfile = open(ScriptName_tmp, "wb")

for block in sourcefile:
iter = block.count(Sep) 
#print "Separators in the current block", iter
count = iter

while count > 0:
# count = iter
    runfile.write((block.split(Sep)[iter-count]+ "\n").replace(' ', ''))
    print (block.split(Sep)[iter-count]+ "\n").replace(' ', '')
count = count - 1
   
sourcefile.close()
runfile.close()
return 0

#----------------------------------------#
# Find the variables in the script       #
#----------------------------------------#

class FindKeyword(object):
def __init__(self,variable,script):
import re
         import sys
         import os.path
#SysArray1 = sys.argv
self.variable = variable
self.script = script
Pattern1 = self.variable
script = self.script
print "FindKeyword class initiated"
#ScriptName = SysArray[2]
def search(self,keyword,script):
self.keyword = keyword
self.script = script
Pattern1_tmp = self.keyword
Pattern1 = Pattern1_tmp.replace('\n','')
script = self.script
print "Pattern being searched", Pattern1
print "Filename is", script
sourcefile = file(script)
reportfile = open("Report.txt", "a")
for line in sourcefile:
#print "Search Result", line.find(Pattern1)
#if Pattern1 in line:
if Pattern1 in line:
reportfile.write(Pattern1 + "|True" + "\n")
reportfile.close()
return True 
reportfile.write(Pattern1 + "|False" + "\n")
return False
#return 0
#return Pattern1

No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered