Friday, 16 August 2013

1st Step towards Text-mining - A python solution to find combination variables in Community detection problem

With so much texts being generated everyday, it is now becoming absolutely important to read the text logs.
Mostly, chats will have lines separated by Newline ('\n'). But, we may want to decide the separator as necessary.

So, I wrote the below 2 scripts that read files to read text between the Separator. It can also search for the text in other files.

For example: 

We are searching for "Saturday Meetup" in all the chat conversations. Combined with the location "Downtown". 
Now, search these 2 variables occurring together in other chats. These users can be combined into the same Community in a Community model. So, they form similar groups for Targeted Interaction (sending gift-pass, coupons etc.)

Sample Methods for the task:

#! /bin/ksh/python

#----------------------------------------------------#
# Project : Python text-modification library         #
# Team: Data Sciences                                #
# Author :                             #
# Date: 23 June, 2013                                #
# Description: Generic classes based on Python       #
#              Each class has the usage and function #
#        defined in header section             #
#----------------------------------------------------#


#------------------------------------------------------#
# Generic class to read read all fields separated      #
# by a separator , for example: comma(,)               #
# Initiate the class using below syntax:               #
# x = DataBetweenSeparator(Script1, Separator)   #
# Use the method using:        #
# x.split(Script1, Separator)         #
#------------------------------------------------------#


# Call necessary libraries

class DataBetweenSeparator(object):
def __init__(self,scriptname,sep):
print "DataBetweenSeparator class initiated"
def split(self,scriptname,sep):
import re
import sys
import os.path
self.scriptname = scriptname
self.sep = sep
Sep = self.sep
ScriptName = self.scriptname
#SysArray = sys.argv
#ScriptName = SysArray[1]
ScriptName_tmp = ScriptName + "_tmp"

print "The source file for parsing is", ScriptName
print "The new file is", ScriptName_tmp

sourcefile = open(ScriptName, "r")
runfile = open(ScriptName_tmp, "wb")

for block in sourcefile:
iter = block.count(Sep) 
#print "Separators in the current block", iter
count = iter

while count > 0:
# count = iter
    runfile.write((block.split(Sep)[iter-count]+ "\n").replace(' ', ''))
    print (block.split(Sep)[iter-count]+ "\n").replace(' ', '')
count = count - 1
   
sourcefile.close()
runfile.close()
return 0

#----------------------------------------#
# Find the variables in the script       #
#----------------------------------------#

class FindKeyword(object):
def __init__(self,variable,script):
import re
         import sys
         import os.path
#SysArray1 = sys.argv
self.variable = variable
self.script = script
Pattern1 = self.variable
script = self.script
print "FindKeyword class initiated"
#ScriptName = SysArray[2]
def search(self,keyword,script):
self.keyword = keyword
self.script = script
Pattern1_tmp = self.keyword
Pattern1 = Pattern1_tmp.replace('\n','')
script = self.script
print "Pattern being searched", Pattern1
print "Filename is", script
sourcefile = file(script)
reportfile = open("Report.txt", "a")
for line in sourcefile:
#print "Search Result", line.find(Pattern1)
#if Pattern1 in line:
if Pattern1 in line:
reportfile.write(Pattern1 + "|True" + "\n")
reportfile.close()
return True 
reportfile.write(Pattern1 + "|False" + "\n")
return False
#return 0
#return Pattern1

Friday, 2 August 2013

Effective design of BI jobs - reducing manual activity while doing retrospective loading in teradata. Using Python to pass date as a parameter to Control-M

Many a times, I come across designers/developers who use "Current_Date" keyword in their queries.
So, their typical queries look like 

Where TransactionDate > Current_Date - 1

Now even though this is a valid Teradata syntax, but that's all.

When you move this code to production and then it fails for a few days, we will have to manually calculate the date-range and run the queries. A moving window will need multiple iterations of manual tasks, especially if we have UPSERT or MERGE statements involved.

Alternative Solution:

We can use Python and Teradata SQL to overcome this problem.

Let us parameterize the date-range in Teradata SQL as :

Where TransactionDate > Current_Date - 1

replaced by

Where TransactionDate > ?ODATE - 1

The below Python script is generic. It can be used to generate date-range files, that be run from a Scheduling tool like Control-M or cronjob.

Copy the below script in a file and name it as ReplaceDate.py

Calling the script:

$>  python ReplaceDate.py ODATE /analytics/advanced/sql/Test.Btq

where ODATE = Date passed from Control-M in format 'yyyy-mm-dd'

Generic Python Script:


#! /usr/bin/python
#--------------------------------------------------------------------------------------#
# Please ensure that you pass 2 parameters                       #
# Sample call : python ImportDate.py 2013-07-25 Test.Btq   #
# Instruction : Ensure that the Test.Btq is in the same         #
#              path as this script                                             #
# Author : Senior Bigdata Engineer                                    #
#------------------------------------------------------------------------------------#
# Call necessary libraries
import re
import sys
import os.path

SysArray = sys.argv
#print (SysArray[1])
print "No. of parameters", (len(SysArray))

Value = "'" + SysArray[1] + "'"
ODATE1 =(Value, '?ODATE')
ParamCount = len(SysArray)
ScriptName = SysArray[2]
ScriptName_tmp = ScriptName + "_tmp"
#print "The temp script is",ScriptName_tmp
if ( ParamCount != 3 ):
print ("Enter a valid Date and BTEQ scriptname")
exit

#----------------------------------------------------#
# This block of code keeps track of the current date #
#----------------------------------------------------#
from time import gmtime,strftime
ODATE2=strftime("%Y-%m-%d",gmtime())
print "ODATE for the run is", ODATE1[0]
print "Current Date is", ODATE2

#------------------------------------------------#
# Define function to replace ODATE in the BTEQ   #
#------------------------------------------------#
if os.path.exists(ScriptName):
print "File " +ScriptName +" exists"
print "Continue with replacement"
else:
print "File " +ScriptName +" does not exist"
sys.exit(1)
#----------------------------------------------------#
# This block creates a list of values to be replaced #
#----------------------------------------------------#

#import re
print "ODATE variable to be replaced", ODATE1[1]
print "Value to be replaced with", ODATE1[0]
sourcefile = open(ScriptName, "r")
runfile = open(ScriptName_tmp, "wb")
#------------------------------------------------------#
# Read each line of sourcefile and replace the pattern #
#------------------------------------------------------#

for Pattern1 in sourcefile:
#print Pattern1.replace(ODATE1[1],ODATE1[0])
runfile.write(Pattern1.replace(ODATE1[1],ODATE1[0]))
#print Pattern1
sourcefile.close()
runfile.close()
#ReplaceOdate(ScriptName)
#import os.path
if os.path.exists(ScriptName_tmp):
print "The temp script is",ScriptName_tmp
sys.exit(0)
else:
print "Please check the process. Temp file not created"
sys.exit(1)

Output of the Script:

Import multiple jobs with different ODATE in the AJF of Control-M. So, each time the ReplaceDate.py script generates files for particular ODATE.
A Test.Btq_tmp will be created where the ?ODATE variable will be replaced by '2013-08-02'

Now run the Test.Btq_tmp file using Control-M in the desired sequence.

Like us if you find the blog helpful. Also, please leave a comment about what else can be improved.




Tuesday, 25 June 2013

Handling Large data volumes - PIG basics and advantages

Hadoop PIG is very efficient in handling large volumes of Data. We could effectively join files, saving a lot of CPU cycles from Teradata server.

It operates in  2 modes - Local and MapReduce.

To invoke the local mode, we type the command as:

bash$> pig -x local

The default mode is Map-Reduce mode.

bash$> pig 

will invoke PIG in the Map Reduce mode.

Running any pig script can be accomplished as below:

bash$> pig TestScript.pig

or 

After logging into Pig:

grunt> exec TestScript.pig


We will follow up with PIG commands and how PIG can be combined with Teradata to give great performance improvements.

Tuesday, 18 June 2013

Loading XML files using Fastload - Using URL to access front-end data and moving data to Teradata

1. A standard xml format

Let us assume that below data is stored in Test.xml:

 <Response>
                <ResponseID>R_3JHyAiOCTXKq</ResponseID>
                <ResponseSet>Default Response Set</ResponseSet>
                <Name>Anonymous</Name>
                <ExternalDataReference></ExternalDataReference>
                <EmailAddress></EmailAddress>
                <IPAddress>96.224.174.233</IPAddress>
                <Status>0</Status>
                <StartDate>2013-01-11 13:55:53</StartDate>
                <EndDate>2013-01-11 13:56:18</EndDate>
                <Finished>1</Finished>
                <Flag></Flag>
 </Response>

So, anything that comes between <Response> and </Response> is our data set.
Each field data is represented within its corresponding tags.

For example : <ResponseID>R_3JHyAiOCTXKq</ResponseID>

So, now to parse this xml, we will have to write the corresponding xslt file

Let us now write the corresponding xslt file called Test.xslt

2. The sample xslt file for the above xml data will be as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">

<xsl:for-each select="xml/Response">
<xsl:value-of select="ResponseID"/>|<xsl:value-of select="ResponseSet"/>|<xsl:value-of select="Name"/>|<xsl:value-of select="ExternalDataReference"/>|<xsl:value-of select="EmailAddress"/>|<xsl:value-of select="IPAddress"/>|<xsl:value-of select="Status"/>|<xsl:value-of select="StartDate"/>|<xsl:value-of select="EndDate"/>|<xsl:value-of select="Finished"/>|<xsl:value-of select="Flag"/>|
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

3. Extract the xml data into a flat file

xsltproc is the command line processor for xml

-o = redirect to output file. Otherwise, data will be redirected to standard output.
Place the Test.xml , Test.xslt in the same directory

Let us now run the below command:

xsltproc Test.xslt Test.xml -o Test.txt

The file Test.txt will contain "|" (pipe delimited) data parsed from the Test.xml file based on the Test.xslt layout file.

Testing row counts:

Data2=`wc -l Test.txt|  cut -d ' ' -f1`

Data1=`grep -i "</Response>" Test.xml | wc -l`

If $Data1 == $Data2, then all the rows were successfully parsed and record count in Test.txt and Test.xml match.




Let us know if this information helped. You could donate for a noble cause.

Also, help us understand what topics would you like covered in the future blogs.

Friday, 14 June 2013

Components of Integration Layer Design - Key considerations to ensure Data Sanity from Source

The Integration Layer marks the transition from raw data to integrated data. In this layer, the raw data is consolidated, duplicate records and values are removed, and disparate sources combined into a single version. 
After the data is passed through the Integration layer, it is consistent and consumable for the DW and BI applications.

Key considerations for Integration Layer processing:


  1. For data with multiple sources, a Master Data Management process should be in place. See the below link for details:                                                                                                                                                     Master_data_management - Benefits and Issues                                                                                               
  2. To have source trace-ability, data should be stored with the corresponding source information. One of the major when an Organization grows is the duplication of Data, due to multiple sources. Data de-duplication is a tedious job. Having Source information at a row-level eases the job to an extent.
  3. To know the user who modified or inserted the data, we should have AlterUserID and InsertTimeStamp columns. We can use the below SQL in Teradata to populate these 2 columns.

 INSERT INTO DIMENSION1
(
SourceCode
,AlterUserID
,InsertTimeStamp
)
Select
'POS Systems' 
,USER
,CURRENT_TIMESTAMP(0);

Making SourceCode, AlterUserID and InsertTimeStamp as NOT NULL columns will ensure that we can always track back changes to a Soure, User and Date.

    4. Updating the Logical and Physical Data Models on a regular basis is a must to keep track of the changes. Data Models are important to understand the flow of data.

This will be the first step in creating Stellar Data, which is clean and traceable.

Master Data Management, Data Models and Storing Source and User information at row-level should help us start building low-maintenance warehouses.


Source: Wikipedia, Teradata Certification

Let us know your comments and suggestions. Please Donate for a cause if you are helped !!