Tuesday 25 June 2013

Handling Large data volumes - PIG basics and advantages

Hadoop PIG is very efficient in handling large volumes of Data. We could effectively join files, saving a lot of CPU cycles from Teradata server.

It operates in  2 modes - Local and MapReduce.

To invoke the local mode, we type the command as:

bash$> pig -x local

The default mode is Map-Reduce mode.

bash$> pig 

will invoke PIG in the Map Reduce mode.

Running any pig script can be accomplished as below:

bash$> pig TestScript.pig

or 

After logging into Pig:

grunt> exec TestScript.pig


We will follow up with PIG commands and how PIG can be combined with Teradata to give great performance improvements.

Tuesday 18 June 2013

Loading XML files using Fastload - Using URL to access front-end data and moving data to Teradata

1. A standard xml format

Let us assume that below data is stored in Test.xml:

 <Response>
                <ResponseID>R_3JHyAiOCTXKq</ResponseID>
                <ResponseSet>Default Response Set</ResponseSet>
                <Name>Anonymous</Name>
                <ExternalDataReference></ExternalDataReference>
                <EmailAddress></EmailAddress>
                <IPAddress>96.224.174.233</IPAddress>
                <Status>0</Status>
                <StartDate>2013-01-11 13:55:53</StartDate>
                <EndDate>2013-01-11 13:56:18</EndDate>
                <Finished>1</Finished>
                <Flag></Flag>
 </Response>

So, anything that comes between <Response> and </Response> is our data set.
Each field data is represented within its corresponding tags.

For example : <ResponseID>R_3JHyAiOCTXKq</ResponseID>

So, now to parse this xml, we will have to write the corresponding xslt file

Let us now write the corresponding xslt file called Test.xslt

2. The sample xslt file for the above xml data will be as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">

<xsl:for-each select="xml/Response">
<xsl:value-of select="ResponseID"/>|<xsl:value-of select="ResponseSet"/>|<xsl:value-of select="Name"/>|<xsl:value-of select="ExternalDataReference"/>|<xsl:value-of select="EmailAddress"/>|<xsl:value-of select="IPAddress"/>|<xsl:value-of select="Status"/>|<xsl:value-of select="StartDate"/>|<xsl:value-of select="EndDate"/>|<xsl:value-of select="Finished"/>|<xsl:value-of select="Flag"/>|
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

3. Extract the xml data into a flat file

xsltproc is the command line processor for xml

-o = redirect to output file. Otherwise, data will be redirected to standard output.
Place the Test.xml , Test.xslt in the same directory

Let us now run the below command:

xsltproc Test.xslt Test.xml -o Test.txt

The file Test.txt will contain "|" (pipe delimited) data parsed from the Test.xml file based on the Test.xslt layout file.

Testing row counts:

Data2=`wc -l Test.txt|  cut -d ' ' -f1`

Data1=`grep -i "</Response>" Test.xml | wc -l`

If $Data1 == $Data2, then all the rows were successfully parsed and record count in Test.txt and Test.xml match.




Let us know if this information helped. You could donate for a noble cause.

Also, help us understand what topics would you like covered in the future blogs.

Friday 14 June 2013

Components of Integration Layer Design - Key considerations to ensure Data Sanity from Source

The Integration Layer marks the transition from raw data to integrated data. In this layer, the raw data is consolidated, duplicate records and values are removed, and disparate sources combined into a single version. 
After the data is passed through the Integration layer, it is consistent and consumable for the DW and BI applications.

Key considerations for Integration Layer processing:


  1. For data with multiple sources, a Master Data Management process should be in place. See the below link for details:                                                                                                                                                     Master_data_management - Benefits and Issues                                                                                               
  2. To have source trace-ability, data should be stored with the corresponding source information. One of the major when an Organization grows is the duplication of Data, due to multiple sources. Data de-duplication is a tedious job. Having Source information at a row-level eases the job to an extent.
  3. To know the user who modified or inserted the data, we should have AlterUserID and InsertTimeStamp columns. We can use the below SQL in Teradata to populate these 2 columns.

 INSERT INTO DIMENSION1
(
SourceCode
,AlterUserID
,InsertTimeStamp
)
Select
'POS Systems' 
,USER
,CURRENT_TIMESTAMP(0);

Making SourceCode, AlterUserID and InsertTimeStamp as NOT NULL columns will ensure that we can always track back changes to a Soure, User and Date.

    4. Updating the Logical and Physical Data Models on a regular basis is a must to keep track of the changes. Data Models are important to understand the flow of data.

This will be the first step in creating Stellar Data, which is clean and traceable.

Master Data Management, Data Models and Storing Source and User information at row-level should help us start building low-maintenance warehouses.


Source: Wikipedia, Teradata Certification

Let us know your comments and suggestions. Please Donate for a cause if you are helped !!