Data Analysis Example: 2013

Monday 2 September 2013

Access Teradata from Python - Generic code to run any Teradata SQL using python

BTEQ is supported on the following platforms:

NCR UNIX SVR4 MP-RAS

IBM z/OS (MVS and USS)

z/OS VM

Microsoft Windows 2000, XP, and Server 2003,2008

Sun Solaris SPARC

Sun Solaris Opteron

IBM AIX

HP-UX PA-RISC and IA64

SUSE and Red Hat Linux

Since, we cannot call the BTEQ utility directly from Python, we would use the Shell script and the subprocess library of Pthon.

The below example shows the approach to access Teradata from python:

1. Write a shell script to do the Bteq login
Contents of the script(TDLogon.sh) will be :

#! /usr/bin/sh

echo " Logging in"
echo $1
Script=$1
echo $Script
bteq < $Script
echo "Logging off"

2. We pass to this script the name of the SQL to be run:
Contents of the SQL script (logon.sql):

--- Contents of the SQL script

.logon edwprod1/datasc1,science123;

.quit

3. Contents of the Python script will be:

#! /usr/bin/python

import subprocess
SqlScript = "logon.sql"

# This can be modified to take different SQL files as parameters

name = "sh -v TDLogon.sh " + SqlScript
print "variable is", name
process = subprocess.Popen(args=[name], shell=True)
#process = subprocess.Popen(args=['sh -v TDLogon.sh logon.sql'] , shell=True)

process.wait()
print "Bteq login testing"

4. Run the python script as :

>> python Unixrun.py

This approach can be further refined to replace the SQL script name as a parameter.
Then start using this script to run all your SQL code from Python.

If you like our website, please donate as it will help us in our philanthropic endeavours!! Thanks.

Friday 16 August 2013

1st Step towards Text-mining - A python solution to find combination variables in Community detection problem

With so much texts being generated everyday, it is now becoming absolutely important to read the text logs.
Mostly, chats will have lines separated by Newline ('\n'). But, we may want to decide the separator as necessary.

So, I wrote the below 2 scripts that read files to read text between the Separator. It can also search for the text in other files.

For example:

We are searching for "Saturday Meetup" in all the chat conversations. Combined with the location "Downtown".
Now, search these 2 variables occurring together in other chats. These users can be combined into the same Community in a Community model. So, they form similar groups for Targeted Interaction (sending gift-pass, coupons etc.)

Sample Methods for the task:

#! /bin/ksh/python

#----------------------------------------------------#

# Project : Python text-modification library #

# Team: Data Sciences #

# Author : #

# Date: 23 June, 2013 #

# Description: Generic classes based on Python #

# Each class has the usage and function #

# defined in header section #

#----------------------------------------------------#

#------------------------------------------------------#

# Generic class to read read all fields separated #

# by a separator , for example: comma(,) #

# Initiate the class using below syntax: #

# x = DataBetweenSeparator(Script1, Separator) #

# Use the method using: #

# x.split(Script1, Separator) #

#------------------------------------------------------#

# Call necessary libraries

class DataBetweenSeparator(object):

def __init__(self,scriptname,sep):

print "DataBetweenSeparator class initiated"

def split(self,scriptname,sep):

import re

import sys

import os.path

self.scriptname = scriptname

self.sep = sep

Sep = self.sep

ScriptName = self.scriptname

#SysArray = sys.argv

#ScriptName = SysArray[1]

ScriptName_tmp = ScriptName + "_tmp"

print "The source file for parsing is", ScriptName

print "The new file is", ScriptName_tmp

sourcefile = open(ScriptName, "r")

runfile = open(ScriptName_tmp, "wb")

for block in sourcefile:

iter = block.count(Sep)

#print "Separators in the current block", iter

count = iter

while count > 0:

# count = iter

runfile.write((block.split(Sep)[iter-count]+ "\n").replace(' ', ''))

print (block.split(Sep)[iter-count]+ "\n").replace(' ', '')

count = count - 1

sourcefile.close()

runfile.close()

return 0

#----------------------------------------#

# Find the variables in the script #

#----------------------------------------#

class FindKeyword(object):

def __init__(self,variable,script):

import re

import sys

import os.path

#SysArray1 = sys.argv

self.variable = variable

self.script = script

Pattern1 = self.variable

script = self.script

print "FindKeyword class initiated"

#ScriptName = SysArray[2]

def search(self,keyword,script):

self.keyword = keyword

self.script = script

Pattern1_tmp = self.keyword

Pattern1 = Pattern1_tmp.replace('\n','')

script = self.script

print "Pattern being searched", Pattern1

print "Filename is", script

sourcefile = file(script)

reportfile = open("Report.txt", "a")

for line in sourcefile:

#print "Search Result", line.find(Pattern1)

#if Pattern1 in line:

if Pattern1 in line:

reportfile.write(Pattern1 + "|True" + "\n")

reportfile.close()

return True

reportfile.write(Pattern1 + "|False" + "\n")

return False

#return 0

#return Pattern1

Friday 2 August 2013

Effective design of BI jobs - reducing manual activity while doing retrospective loading in teradata. Using Python to pass date as a parameter to Control-M

Many a times, I come across designers/developers who use "Current_Date" keyword in their queries.
So, their typical queries look like

Where TransactionDate > Current_Date - 1

Now even though this is a valid Teradata syntax, but that's all.

When you move this code to production and then it fails for a few days, we will have to manually calculate the date-range and run the queries. A moving window will need multiple iterations of manual tasks, especially if we have UPSERT or MERGE statements involved.

Alternative Solution:

We can use Python and Teradata SQL to overcome this problem.

Let us parameterize the date-range in Teradata SQL as :

Where TransactionDate > Current_Date - 1

replaced by

Where TransactionDate > ?ODATE - 1

The below Python script is generic. It can be used to generate date-range files, that be run from a Scheduling tool like Control-M or cronjob.

Copy the below script in a file and name it as ReplaceDate.py

Calling the script:

$> python ReplaceDate.py ODATE /analytics/advanced/sql/Test.Btq

where ODATE = Date passed from Control-M in format 'yyyy-mm-dd'

Generic Python Script:

#! /usr/bin/python
#--------------------------------------------------------------------------------------#
# Please ensure that you pass 2 parameters #
# Sample call : python ImportDate.py 2013-07-25 Test.Btq #
# Instruction : Ensure that the Test.Btq is in the same #
# path as this script #
# Author : Senior Bigdata Engineer #
#------------------------------------------------------------------------------------#
# Call necessary libraries
import re
import sys
import os.path

SysArray = sys.argv
#print (SysArray[1])
print "No. of parameters", (len(SysArray))

Value = "'" + SysArray[1] + "'"
ODATE1 =(Value, '?ODATE')
ParamCount = len(SysArray)
ScriptName = SysArray[2]
ScriptName_tmp = ScriptName + "_tmp"
#print "The temp script is",ScriptName_tmp
if ( ParamCount != 3 ):
print ("Enter a valid Date and BTEQ scriptname")
exit

#----------------------------------------------------#
# This block of code keeps track of the current date #
#----------------------------------------------------#
from time import gmtime,strftime
ODATE2=strftime("%Y-%m-%d",gmtime())
print "ODATE for the run is", ODATE1[0]
print "Current Date is", ODATE2

#------------------------------------------------#
# Define function to replace ODATE in the BTEQ #
#------------------------------------------------#
if os.path.exists(ScriptName):
print "File " +ScriptName +" exists"
print "Continue with replacement"
else:
print "File " +ScriptName +" does not exist"
sys.exit(1)
#----------------------------------------------------#
# This block creates a list of values to be replaced #
#----------------------------------------------------#

#import re
print "ODATE variable to be replaced", ODATE1[1]
print "Value to be replaced with", ODATE1[0]
sourcefile = open(ScriptName, "r")
runfile = open(ScriptName_tmp, "wb")
#------------------------------------------------------#
# Read each line of sourcefile and replace the pattern #
#------------------------------------------------------#

for Pattern1 in sourcefile:
#print Pattern1.replace(ODATE1[1],ODATE1[0])
runfile.write(Pattern1.replace(ODATE1[1],ODATE1[0]))
#print Pattern1
sourcefile.close()
runfile.close()
#ReplaceOdate(ScriptName)
#import os.path
if os.path.exists(ScriptName_tmp):
print "The temp script is",ScriptName_tmp
sys.exit(0)
else:
print "Please check the process. Temp file not created"
sys.exit(1)

Output of the Script:

Import multiple jobs with different ODATE in the AJF of Control-M. So, each time the ReplaceDate.py script generates files for particular ODATE.

A Test.Btq_tmp will be created where the ?ODATE variable will be replaced by '2013-08-02'

Now run the Test.Btq_tmp file using Control-M in the desired sequence.

Like us if you find the blog helpful. Also, please leave a comment about what else can be improved.

Tuesday 25 June 2013

Handling Large data volumes - PIG basics and advantages

Hadoop PIG is very efficient in handling large volumes of Data. We could effectively join files, saving a lot of CPU cycles from Teradata server.

It operates in 2 modes - Local and MapReduce.

To invoke the local mode, we type the command as:

bash$> pig -x local

The default mode is Map-Reduce mode.

bash$> pig

will invoke PIG in the Map Reduce mode.

Running any pig script can be accomplished as below:

bash$> pig TestScript.pig

or

After logging into Pig:

grunt> exec TestScript.pig

We will follow up with PIG commands and how PIG can be combined with Teradata to give great performance improvements.

Tuesday 18 June 2013

Loading XML files using Fastload - Using URL to access front-end data and moving data to Teradata

1. A standard xml format

Let us assume that below data is stored in Test.xml:

<ResponseID>R_3JHyAiOCTXKq</ResponseID>

<ResponseSet>Default Response Set</ResponseSet>

<Name>Anonymous</Name>

</Response>

So, anything that comes between <Response> and </Response> is our data set.

Each field data is represented within its corresponding tags.

For example : <ResponseID>R_3JHyAiOCTXKq</ResponseID>

So, now to parse this xml, we will have to write the corresponding xslt file

Let us now write the corresponding xslt file called Test.xslt

2. The sample xslt file for the above xml data will be as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>

<xsl:stylesheet version="2.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:template match="/">

<xsl:for-each select="xml/Response">

</xsl:for-each>

</xsl:template>

</xsl:stylesheet>

3. Extract the xml data into a flat file

xsltproc is the command line processor for xml

-o = redirect to output file. Otherwise, data will be redirected to standard output.

Place the Test.xml , Test.xslt in the same directory

Let us now run the below command:

xsltproc Test.xslt Test.xml -o Test.txt

The file Test.txt will contain "|" (pipe delimited) data parsed from the Test.xml file based on the Test.xslt layout file.

Testing row counts:

Data2=`wc -l Test.txt| cut -d ' ' -f1`

Data1=`grep -i "</Response>" Test.xml | wc -l`

If $Data1 == $Data2, then all the rows were successfully parsed and record count in Test.txt and Test.xml match.

4. Load the Test.txt using Fastload into Teradata

Let us know if this information helped. You could donate for a noble cause.

Also, help us understand what topics would you like covered in the future blogs.

Friday 14 June 2013

Components of Integration Layer Design - Key considerations to ensure Data Sanity from Source

The Integration Layer marks the transition from raw data to integrated data. In this layer, the raw data is consolidated, duplicate records and values are removed, and disparate sources combined into a single version.

After the data is passed through the Integration layer, it is consistent and consumable for the DW and BI applications.

Key considerations for Integration Layer processing:

For data with multiple sources, a Master Data Management process should be in place. See the below link for details: Master_data_management - Benefits and Issues
To have source trace-ability, data should be stored with the corresponding source information. One of the major when an Organization grows is the duplication of Data, due to multiple sources. Data de-duplication is a tedious job. Having Source information at a row-level eases the job to an extent.
To know the user who modified or inserted the data, we should have AlterUserID and InsertTimeStamp columns. We can use the below SQL in Teradata to populate these 2 columns.

INSERT INTO DIMENSION1

(

SourceCode

,AlterUserID

,InsertTimeStamp

)

Select

'POS Systems'

,USER

,CURRENT_TIMESTAMP(0);

Making SourceCode, AlterUserID and InsertTimeStamp as NOT NULL columns will ensure that we can always track back changes to a Soure, User and Date.

4. Updating the Logical and Physical Data Models on a regular basis is a must to keep track of the changes. Data Models are important to understand the flow of data.

This will be the first step in creating Stellar Data, which is clean and traceable.

Master Data Management, Data Models and Storing Source and User information at row-level should help us start building low-maintenance warehouses.

Source: Wikipedia, Teradata Certification

Let us know your comments and suggestions. Please Donate for a cause if you are helped !!

Saturday 25 May 2013

Secondary Indexes and PPI in a Create table - Create column as Varchar(0) - Scenario for Teradata SQL exam

There is a basic difference between the below 2 statements:

1. CREATE TABLE Test1.TableA as Test1.TableB with data;

The above statement creates TableA as a replica of TableB, preserving its Secondary Indexes and Partitioned Primary Indexes.

2. CREATE TABLE Test1.TableA as
(Select * from Test.TableB)
with data PRIMARY INDEX(Cus_ID);

The above statement creates TableA with only the data present in TableB. The Secondary Indexes and Partitioned Primary Indexes are not preserved. The table chooses the PI defined in the Create Table statement.
Also, if there is a column, say Col2 in TableB that has no data, it may get created as Varchar(0) in TableA.

So, when you do a
Show Table Test1.TableA;

you get

Cus_ID Integer Not Null,
Col2 Varchar(0) Character Set Latin

Watch out for these differences when you are tuning your queries!!

Like us if you are helped.

Saturday 27 April 2013

Drive high value Sales in Retail - A Data Science Case study

In a Retail store, many a times we have bad user experiences due to long queues at the billing register. But, even with efficient billing agents, we can only achieve a certain limit. So, if "N" is the total transactions and “t” represents time, the below formula signifies that we can accommodate maximum 2000 customer per billing register.

R (Resultant transactions) = 0ò²⁰⁰⁰dN/dt

We can only accommodate 2000 transactions per day per register. So, it’s a bottleneck.

Now, we can add more registers, but that comes at a cost.

The alternative to increase the sales is to convert each transaction to be a high dollar purchase. So, we implement business intelligence algorithms to drive high value sales.

Even price sensitive users on exposure to better quality/service opt for higher value products.

For example:

A customer buys 2 items.

Average time for billing 2 items = 2*R = 4 mins

Case 1 :

Each item is $5 each. Total value of transaction = $10

Time taken = 4 mins.

Hence, value generated in unit time = $10/4 = $2.5

Case 2 :

Each item is $9 each. Total value of the transaction = $18

Time taken = 4 mins.

Hence, value generated in unit time = $18/4 = $4.5

Probability and Statistics can help answer numerous Business problems.

Think about it !!

Business Intelligence helps us to answer many such questions with ease. We will cover more such case-study later.

Sunday 21 April 2013

Sample UNIX script to compress and uncompress files - Use of the gzip, tar, zip command

There are different ways of creating file archive. Once, we create the archive, we need to know how to extract the file back and use it.

The below commands work great to achieve our goal:

Archive creation and extraction:

Using the tar command:

tar -cvf TEST.tar ./* - Running this command compresses all the files located in current directory into TEST.tar file
tar -xvf TEST.tar - This command will extract all the files from the archive TEST.tar
tar -tf TEST.tar - This command will list all the files in the archive TEST.tar

Using the gzip command:

gzip Test.txt - This command will create a gzip file Test.txt.gz
To unzip the file - gunzip Test.txt.gz

Using the zip comand:

1. zip Test1 Test.txt - This will create a file Test1.zip

2. unzip Test1.zip -d /tmp/Test - This command will unzip the files in Test1.zip to the /tmp/Test path

Let us know if this was helpful !!

Monday 15 April 2013

Teradata Fastexport returning blank lines - Common problems of having NULL values in table - Example and Syntax

NULL handling is the one of the most important aspects of Reporting queries.
So, if you face a situation where your Fastexport is returning blank lines, check the columns for NULL again.

For example, in the below script, CustEmailAddress is a NULL column. A customer may or maynot provide his/her email address.
And for all rows where CustEmailAddress is NULL, the entire row becomes NULL on concatenation. Any operation on NULL results in a NULL.

Hence, these rows appear as blank rows in the export file.

So, we have to be careful while exporting such values in Fastexport.

Steps to be performed to avoid blank lines:

Use Coalesce function to send a default value if NULL exists
Also, if the columns are of different datatypes, the default value in COALESCE may vary.
For email columns(CustEmailAddress in the example below), a default of '0' may lead to misinterpretation of data. Chooose the default value carefully. We would suggest confirm the DEFAULT values from Downstream applications
Best practice is to store all columns as Varchar and then send "BLANK" instead of NULL

You can also use the below UNIX command to handle Blank rows in the fastexport output file:

sed '/^$/d' /tmp/outfile1.txt > /out/outfile_20130405.txt

Example Script:

/* Weird behavior of NULL */
/* If any column is null and coalesce is missing, appending the null column leads to an entire null row */

SELECT TRIM(DATA1)
FROM
(
SELECT
'Â¿'
|| trim(coalesce(SourceId,''))
|| '|' ||trim(coalesce(Dates,''))
|| '|' ||trim(coalesce(IPAddress,''))
|| '|' ||trim(coalesce(SalesNum,''))
|| '|' ||trim(coalesce(StoreNum,''))
|| '|' ||trim(coalesce(CustFirstName,''))
|| '|' ||trim(coalesce(CustEmailAddress,''))
|| '|' ||trim(coalesce(OrderValue,''))
|| '|' ||trim(coalesce(Sellerid,''))
AS DATA1
FROM
db_work_tbls.CustOrder1
)as A
;

Let us know if this information has helped. Like us on Facebook or Google+

Please donate as your donation will help someone in need.

Saturday 13 April 2013

NULL handling in NOT IN clause - Scenario to handle NULL in Teradata

Very often, we try to find out the values from a table (employee) where the employee_ID is not in the table Resigned.

One way of achieving that is to use a NOT IN clause. But, beware, since presence of NULL may give you the wrong information.

NULL in not exists results in no rows returned, because no NULL in Teradata is equal.
Moreover, any operation on NULL returns a NULL.

So, make sure to modify your inserts using the Coalesce(EMP_Salary, 0) for all retired employees.

The below example illustrates this unusual behavior of NULL:

Sample script:

/* Unusual behavior of NULL in NOT IN */

create table lci_work_tbls.tera11

(

col1 integer

) ;

insert into lci_work_tbls.tera11 values (1);

insert into lci_work_tbls.tera11 values (2);

insert into lci_work_tbls.tera11 values (3);

create table lci_work_tbls.tera12

(

col2 integer

) ;

insert into lci_work_tbls.tera12 values (1);

insert into lci_work_tbls.tera12 values (null);

insert into lci_work_tbls.tera11 values (3);

/* Test to find the rows returned using NOT IN */

select col1 from lci_work_tbls.tera11

where col1 not in (select col2 from lci_work_tbls.tera12)

;

Result - NO ROWS RETURNED

Like us on Facebook or Google+ if you like our posts. Share what topics you want us to cover.

Please donate if you are helped. Your donation will help those who need help !!

Saturday 6 April 2013

BTEQ Import - The key to load delta data at near Real-time frequency - Sample script included

TPump is hands-down the best utility to maintain an active data-warehouse.
But many architects feel that if the data can be refreshed every 3 hours, and the Teradata System Peaks are not regular, we can use BTEQ Import.

Its easy to use, uses minimal system resources and ideal for small Delta loads.

Example script:

.logon edw/developer1,test;

.IMPORT REPORT FILE = /home/dev1/TRAN.MAR262013.csv

.REPEAT * PACK 250

USING

(

IN_REC_TYP CHAR(2)

,IN_CUS_NO CHAR(18)

,IN_FTY_NO CHAR(7)

,IN_REG_NO CHAR(3)

,IN_TRS_NO CHAR(4)

,IN_TRS_DT CHAR(10)

,IN_TRS_TM CHAR(8)

,IN_APP_ID CHAR(9)

,IN_AMT CHAR(7)

,IN_ADD_DWN_PMT_AM CHAR(7)

,IN_FST_PMT_AM CHAR(7)

)

INSERT INTO Database1.Pos_register

(

CUS_ID_NO

,IAN_TYP_CD

,FTY_NO

,REG_NO

,TRS_NO

,TRS_DT

,TRS_TM

,APP_ID

,AMT

,ADD_DWN_PMT_AM

,FST_PMT_AM

)

VALUES

(

:IN_CUS_ID_NO

,'RS'

,TRIM(:IN_FTY_NO)

,TRIM(:IN_REG_NO)

,TRIM(:IN_TRS_NO)

,:IN_trs_dt (DATE,FORMAT'YYYY-MM-DD')

,:IN_trs_tm (TIME(0))

,NULLIF(:IN_APP_ID, ' ')

,CASE WHEN TRIM(:IN_LSE_AM) = ''

THEN 0

ELSE (TRIM(:IN_AMT))

END

,CASE WHEN TRIM(:IN_ADD_DWN_PMT_AM/100) = ''

THEN 0

ELSE (TRIM(:IN_ADD_DWN_PMT_AM/100))

END

,CASE WHEN TRIM(:IN_FST_PMT_AM) = ''

THEN 0

ELSE (TRIM(:IN_FST_PMT_AM/100))

END

);

.IF ERRORCODE <> 0 THEN .QUIT ERRORCODE;

Advantages of BTEQ Import:

1. Bteq import is light and uses less system resources as compared to its heavy-duty cousins.

2. Bteq import allows manipulation of the incoming data during the load. See the example above, where we have divided the :IN_ADD_DWN_PMT_AM

3. The PACK option helps us to load multiple rows based on the value set. In the above example data is loaded in 250 rows per session.

Let us know if this topic helped you. Share our blog on Facebook on Google+.

Please donate using the button on the right. Your donation will help someone.

Friday 15 March 2013

Important DBC tables in Teradata - The data-dictionary of Teradata

The following tables provide a great deal of information about your Teradata system. They store the demographics of table, views, macros, column-definition, error-codes and numerous other information.

All we have to understand is how to use these tables to help ourselves and enable our Organization to "Work Smarter"

Examples:

1. SELECT * FROM DBC.TABLES WHERE TABLENAME LIKE '%Customer%'

----this is typically used in impact analysis phase, to see how many tables belong to a particular subject area. (Customer in the above example)

2. SELECT * FROM DBC.INDICES

----this table stores information about the indices, which table they are stored, etc.

3. SELECT * FROM DBC.COLUMNS

---this table stores the column information, not null/nullable, datatype, unique/non-unique. Greatly helps in Physical Data Modelling activities

4. SELECT * FROM DBC.DBQLOGTBL

---- The DBC.DBQLogTbl stands for database query log. As the name explains, it contains all the core performance data for the queries that run on the system. Before you start querying this table ensure that Query logging is enabled for the User.

This can be accomplished for training_1 user as follows:

---- enable query logging so that all the queries run by a particular user is logged

begin query logging with sql on training_1 ;

select * from dbc.users;
select * from dbc.dbase;
select db1.databasename,dblog1.*from dbc.dbqlogtbl as dblog1
inner join
dbc.dbase as db1
on dblog1.userid = db1.databaseid

where db1.databasename = 'training_1';

5. SELECT * FROM DBC.ERRORMSGS WHERE ERRORCODE =3523

---this table gives an understandable description of the Teradata errors. Great for any Developer.

6. select * from dbc.dbqlogtbl where querytext like '%corp_ksn%' order by CollectTimeStamp desc

---- this query will return the logon information about the user along with the IP address and username. Any modification to the table can be easily tracked using this query.

7. Select count(distinct nodeid) from dbc.resusagescpu;

--- this query will let you find the total no. of Teradata nodes in the system

8. Select nodeid,count(distinct Vproc) from dbc.ResCpuUsageByAmpView group by 1;

---- this query will allow to find the no. of Vproc in the system

Please Donate using the Donate button. Your support is much appreciated !!

Like us on Facebook or Google+

Wednesday 20 February 2013

Fastload pre-processing script in UNIX

We may get files where the total no. of columns in the file is not as per the Defined layout in Fastload.

As part of Optimization of the Batch process, we can abort the Fastload job if the file is not as per the Defined format.

Below is an example script to achieve that:

#! /bin/ksh

while read line

do

# Assuming that "|" is the separator in your file

Len1=`echo $line | awk -F '|' '{print NF}'`

echo $Len1 $line

# If your file is supposed to have only 32 columns

if [ $Len1 != 32 ]
then
echo $line >>Error_Customer_Connect.txt
fi

done < Extract_Customer_Connect.txt

if [ -s Error_Customer_Connect.txt ]
then
echo "File is not per the defined layout"
exit 1
fi

exit 0

Let us know if this helped. Like us or share this blog on Google+

Donate a small amount for renovation of historical structures and for feeding the needy.

Tuesday 12 February 2013

Advantages and Disadvantages of MVC - Compression in Teradata with sample script

Multi-Value Compression (MVC), also called field compression, allows compression to be applied at a column level. MVC replaces values specified by the user with a compact bit pattern. When the data is accessed, the bit pattern is used to look up the original value in the list (or dictionary).

For the character data type, before Teradata 13.10, MVC was restricted to fixed-length character columns (CHAR data type) and columns up to 255 characters wide. Teradata 13.10 onwards, MVC can be used with variable-length character columns (VARCHAR data type) and columns up to 510 characters wide. It can support any numeric type, all character data, GRAPHIC, VARGRAPHIC, BYTE and VARBYTE.

Compression is effectively used to save space for repeating data.

For example, the SALE_SEQ_NBR column, the value always ranges between 1 to 18.
So, its better to compress the value and store it in the table-header.

Sample Script for compression:

CREATE SET TABLE DB_PROD1.sales_detail ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
DATABLOCKSIZE = 130048 BYTES, FREESPACE = 15 PERCENT, CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
CUS_ID_NO DECIMAL(18,0) NOT NULL,
IAN_TYP_CD CHAR(2) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS 'KS',
SALE_NBR DECIMAL(15,0) NOT NULL,
SALE_SEQ_NBR SMALLINT NOT NULL DEFAULT 0 COMPRESS (1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,14 ,15 ,16 ,17 ,18 ),
LOCN_NBR SMALLINT NOT NULL DEFAULT 0 ,
KSN_ID DECIMAL(9,0) NOT NULL DEFAULT 0. ,
DAY_DT DATE FORMAT 'YYYY/MM/DD' NOT NULL,
UPC_NBR DECIMAL(13,0) NOT NULL,
SKU_PRC_TYPE_CD BYTEINT NOT NULL COMPRESS (1 ,3 ,6 ),
SELL_TYPE_CD BYTEINT NOT NULL COMPRESS 0 ,
SELL_QTY DECIMAL(9,4) NOT NULL COMPRESS (0.0000 ,1.0000 ,-1.0000 ),
SELL_AMT DECIMAL(9,2) NOT NULL COMPRESS (12.99 ,2.79 ,7.99 ,2.99 ,3.00 ,0.50 ),
)
PRIMARY INDEX ( CUS_ID_NO )
PARTITION BY RANGE_N(DAY_DT BETWEEN DATE '2006-01-01' AND DATE '2010-12-31' EACH INTERVAL '1' MONTH ,
DATE '2011-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' DAY ,
NO RANGE, UNKNOWN)
INDEX ( KSN_ID )
INDEX ( LOCN_NBR ,DAY_DT )
INDEX ( UPC_NBR );

Real-life Scenario faced:

Disadvantages-

MVC leading to long-running queries and using extra CPU cycles.
Compressed columns are not used for Partition Primary Index

Explanation: If you have a compression on a secondary index, the data will be skewed - ( as data is distributed according to PI).
Now, there is a small overhead of fetching data based on MVC value. This will be magnified if the data is skewed. Hence, for long-running queries, the hot-amp may become very CPU-intensive.

Hence, we suggest to use MVC for cold-data. The columns in SCD that are descriptive and not used in joins.

We will cover Block-level compression and Algorithmic Compression in a later post.

Please donate if you find this post helpful by using the button on right !! Let us know your issues, and we will solve them for you.

Data Analysis Example

Pages

Monday 2 September 2013

Access Teradata from Python - Generic code to run any Teradata SQL using python

Friday 16 August 2013

1st Step towards Text-mining - A python solution to find combination variables in Community detection problem

Friday 2 August 2013

Effective design of BI jobs - reducing manual activity while doing retrospective loading in teradata. Using Python to pass date as a parameter to Control-M

Tuesday 25 June 2013

Handling Large data volumes - PIG basics and advantages

Tuesday 18 June 2013

Loading XML files using Fastload - Using URL to access front-end data and moving data to Teradata

Friday 14 June 2013

Components of Integration Layer Design - Key considerations to ensure Data Sanity from Source

Saturday 25 May 2013

Secondary Indexes and PPI in a Create table - Create column as Varchar(0) - Scenario for Teradata SQL exam

Saturday 27 April 2013

Drive high value Sales in Retail - A Data Science Case study

Sunday 21 April 2013

Sample UNIX script to compress and uncompress files - Use of the gzip, tar, zip command

Monday 15 April 2013

Teradata Fastexport returning blank lines - Common problems of having NULL values in table - Example and Syntax

Saturday 13 April 2013

NULL handling in NOT IN clause - Scenario to handle NULL in Teradata

Saturday 6 April 2013

BTEQ Import - The key to load delta data at near Real-time frequency - Sample script included

Friday 15 March 2013

Important DBC tables in Teradata - The data-dictionary of Teradata

Wednesday 20 February 2013

Fastload pre-processing script in UNIX

Tuesday 12 February 2013

Advantages and Disadvantages of MVC - Compression in Teradata with sample script

Knowledge Archive