Hadoop PIG is very efficient in handling large volumes of Data. We could effectively join files, saving a lot of CPU cycles from Teradata server.
It operates in 2 modes - Local and MapReduce.
To invoke the local mode, we type the command as:
bash$> pig -x local
The default mode is Map-Reduce mode.
bash$> pig
will invoke PIG in the Map Reduce mode.
Running any pig script can be accomplished as below:
bash$> pig TestScript.pig
or
After logging into Pig:
grunt> exec TestScript.pig
We will follow up with PIG commands and how PIG can be combined with Teradata to give great performance improvements.
Tuesday, 25 June 2013
Handling Large data volumes - PIG basics and advantages
Labels:
big data handling,
example,
hadoop,
mapreduce and local mode in pig,
PIG,
teradata
Tuesday, 18 June 2013
Loading XML files using Fastload - Using URL to access front-end data and moving data to Teradata
1. A standard xml format
Let us assume that below data is stored in Test.xml:
<Response>
<ResponseID>R_3JHyAiOCTXKq< /ResponseID>
<ResponseSet>Default Response Set</ResponseSet>
<Name>Anonymous</Name>
<ExternalDataReference></ ExternalDataReference>
<EmailAddress></EmailAddress>
<IPAddress>96.224.174.233</ IPAddress>
<Status>0</Status>
<StartDate>2013-01-11 13:55:53</StartDate>
<EndDate>2013-01-11 13:56:18</EndDate>
<Finished>1</Finished>
<Flag></Flag>
</Response>
So, anything that comes between <Response> and </Response> is our data set.
Each field data is represented within its corresponding tags.
For example : <ResponseID>R_3JHyAiOCTXKq< /ResponseID>
So, now to parse this xml, we will have to write the corresponding xslt file
Let us now write the corresponding xslt file called Test.xslt
2. The sample xslt file for the above xml data will be as follows:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/ 1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="xml/Response">
<xsl:value-of select="ResponseID"/>|<xsl: value-of select="ResponseSet"/>|<xsl: value-of select="Name"/>|<xsl:value-of select="ExternalDataReference" />|<xsl:value-of select="EmailAddress"/>|<xsl: value-of select="IPAddress"/>|<xsl: value-of select="Status"/>|<xsl:value- of select="StartDate"/>|<xsl: value-of select="EndDate"/>|<xsl:value- of select="Finished"/>|<xsl: value-of select="Flag"/>|
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
3. Extract the xml data into a flat file
xsltproc is the command line processor for xml
-o = redirect to output file. Otherwise, data will be redirected to standard output.
Place the Test.xml , Test.xslt in the same directory
Let us now run the below command:
xsltproc Test.xslt Test.xml -o Test.txt
The file Test.txt will contain "|" (pipe delimited) data parsed from the Test.xml file based on the Test.xslt layout file.
Testing row counts:
Data2=`wc -l Test.txt| cut -d ' ' -f1`
Data1=`grep -i "</Response>" Test.xml | wc -l`
If $Data1 == $Data2, then all the rows were successfully parsed and record count in Test.txt and Test.xml match.
Let us know if this information helped. You could donate for a noble cause.
Also, help us understand what topics would you like covered in the future blogs.
Labels:
API download,
fastload,
loading xml into teradata,
sample script for xml conversion,
URL loading,
xml to flat file conversion using xslt
Friday, 14 June 2013
Components of Integration Layer Design - Key considerations to ensure Data Sanity from Source
The Integration Layer marks the transition from raw data to integrated data. In this layer, the raw data is consolidated, duplicate records and values are removed, and disparate sources combined into a single version.
After the data is passed through the Integration layer, it is consistent and consumable for the DW and BI applications.
Key considerations for Integration Layer processing:
- For data with multiple sources, a Master Data Management process should be in place. See the below link for details: Master_data_management - Benefits and Issues
- To have source trace-ability, data should be stored with the corresponding source information. One of the major when an Organization grows is the duplication of Data, due to multiple sources. Data de-duplication is a tedious job. Having Source information at a row-level eases the job to an extent.
- To know the user who modified or inserted the data, we should have AlterUserID and InsertTimeStamp columns. We can use the below SQL in Teradata to populate these 2 columns.
INSERT INTO DIMENSION1
(
SourceCode
,AlterUserID
,InsertTimeStamp
)
Select
'POS Systems'
,USER
,CURRENT_TIMESTAMP(0);
Making SourceCode, AlterUserID and InsertTimeStamp as NOT NULL columns will ensure that we can always track back changes to a Soure, User and Date.
4. Updating the Logical and Physical Data Models on a regular basis is a must to keep track of the changes. Data Models are important to understand the flow of data.
This will be the first step in creating Stellar Data, which is clean and traceable.
Master Data Management, Data Models and Storing Source and User information at row-level should help us start building low-maintenance warehouses.
Source: Wikipedia, Teradata Certification
Let us know your comments and suggestions. Please Donate for a cause if you are helped !!
Labels:
data model,
example,
Integration layer design of data warehouse,
master data management,
teradata certification
Subscribe to:
Posts (Atom)