Data Analysis Example: teradata

Showing posts with label teradata. Show all posts

Tuesday, 5 April 2016

How do you run a scala script in scala command line

You might be using interactive mode of Scala to look at your data. But, if you have written the steps in a dot(.)scala file, then how do you execute. You must know by now that .scala is not mandatory, but its a good practice to follow the convention. We can use the :load option to execute this script from a scala REPL

This is what you do:

1. Write the scala commands in a script CountExample_shell.scala

sc;
val pagecounts = sc.textFile("/home/training/pagecounts/");

// take the first 10 lines and print them
pagecounts.take(10).foreach(println);

pagecounts.count;

// filter only lines that have 'en' (english) for 2nd value of the array
val enPages = pagecounts.filter(_.split(" ")(1) == "en").cache;

enPages.count;

//Create key value pairs in scala
val enTuples = enPages.map(line => line.split(" "));

val enKeyValuePairs = enTuples.map(line => (line(0).substring(0,8) , line(3).toInt));

enKeyValuePairs.reduceByKey(_+_, 1).collect;

enPages.map(l => l.split(" ")).map(l => (l(2), l(3).toInt)).reduceByKey(_+_ , 40).filter( x => x._2 > 200000).map (x => (x._2 , x._1)).collect.foreach(println);

2. Now, to execute the script, use

scala> :load /home/training/CountExample_shell.scala

3. The script will execute and display the below lines.

Loading /home/training/CountExample_shell.scala...
res24: String = /home/training/spark-1.6.0-bin-hadoop2.6/bin/spark-shell
res25: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1fa98a22
pagecounts: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at textFile at <console>:32
20090505-000000 aa Main_Page 2 9980
20090505-000000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 465
20090505-000000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 1 16086
20090505-000000 af.b Tuisblad 1 36236
20090505-000000 af.d Tuisblad 4 189738
20090505-000000 af.q Tuisblad 2 56143
20090505-000000 af Afrika 1 46833
20090505-000000 af Afrikaans 2 53577
20090505-000000 af Australi%C3%AB 1 132432
20090505-000000 af Barack_Obama 1 23368
res27: Long = 1398882
enPages: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at filter at <console>:34
res28: Long = 970545
enTuples: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:36
enKeyValuePairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:38
res29: Array[(String, Int)] = Array((20090507,6175726), (20090505,7076855))
(468159,Special:Search)
(451126,Main_Page)
(1066734,404_error/)

Please post your queries on Scala and let us know your thoughts.

Friday, 2 August 2013

Effective design of BI jobs - reducing manual activity while doing retrospective loading in teradata. Using Python to pass date as a parameter to Control-M

Many a times, I come across designers/developers who use "Current_Date" keyword in their queries.
So, their typical queries look like

Where TransactionDate > Current_Date - 1

Now even though this is a valid Teradata syntax, but that's all.

When you move this code to production and then it fails for a few days, we will have to manually calculate the date-range and run the queries. A moving window will need multiple iterations of manual tasks, especially if we have UPSERT or MERGE statements involved.

Alternative Solution:

We can use Python and Teradata SQL to overcome this problem.

Let us parameterize the date-range in Teradata SQL as :

Where TransactionDate > Current_Date - 1

replaced by

Where TransactionDate > ?ODATE - 1

The below Python script is generic. It can be used to generate date-range files, that be run from a Scheduling tool like Control-M or cronjob.

Copy the below script in a file and name it as ReplaceDate.py

Calling the script:

$> python ReplaceDate.py ODATE /analytics/advanced/sql/Test.Btq

where ODATE = Date passed from Control-M in format 'yyyy-mm-dd'

Generic Python Script:

#! /usr/bin/python
#--------------------------------------------------------------------------------------#
# Please ensure that you pass 2 parameters #
# Sample call : python ImportDate.py 2013-07-25 Test.Btq #
# Instruction : Ensure that the Test.Btq is in the same #
# path as this script #
# Author : Senior Bigdata Engineer #
#------------------------------------------------------------------------------------#
# Call necessary libraries
import re
import sys
import os.path

SysArray = sys.argv
#print (SysArray[1])
print "No. of parameters", (len(SysArray))

Value = "'" + SysArray[1] + "'"
ODATE1 =(Value, '?ODATE')
ParamCount = len(SysArray)
ScriptName = SysArray[2]
ScriptName_tmp = ScriptName + "_tmp"
#print "The temp script is",ScriptName_tmp
if ( ParamCount != 3 ):
print ("Enter a valid Date and BTEQ scriptname")
exit

#----------------------------------------------------#
# This block of code keeps track of the current date #
#----------------------------------------------------#
from time import gmtime,strftime
ODATE2=strftime("%Y-%m-%d",gmtime())
print "ODATE for the run is", ODATE1[0]
print "Current Date is", ODATE2

#------------------------------------------------#
# Define function to replace ODATE in the BTEQ #
#------------------------------------------------#
if os.path.exists(ScriptName):
print "File " +ScriptName +" exists"
print "Continue with replacement"
else:
print "File " +ScriptName +" does not exist"
sys.exit(1)
#----------------------------------------------------#
# This block creates a list of values to be replaced #
#----------------------------------------------------#

#import re
print "ODATE variable to be replaced", ODATE1[1]
print "Value to be replaced with", ODATE1[0]
sourcefile = open(ScriptName, "r")
runfile = open(ScriptName_tmp, "wb")
#------------------------------------------------------#
# Read each line of sourcefile and replace the pattern #
#------------------------------------------------------#

for Pattern1 in sourcefile:
#print Pattern1.replace(ODATE1[1],ODATE1[0])
runfile.write(Pattern1.replace(ODATE1[1],ODATE1[0]))
#print Pattern1
sourcefile.close()
runfile.close()
#ReplaceOdate(ScriptName)
#import os.path
if os.path.exists(ScriptName_tmp):
print "The temp script is",ScriptName_tmp
sys.exit(0)
else:
print "Please check the process. Temp file not created"
sys.exit(1)

Output of the Script:

Import multiple jobs with different ODATE in the AJF of Control-M. So, each time the ReplaceDate.py script generates files for particular ODATE.

A Test.Btq_tmp will be created where the ?ODATE variable will be replaced by '2013-08-02'

Now run the Test.Btq_tmp file using Control-M in the desired sequence.

Like us if you find the blog helpful. Also, please leave a comment about what else can be improved.

Tuesday, 25 June 2013

Handling Large data volumes - PIG basics and advantages

Hadoop PIG is very efficient in handling large volumes of Data. We could effectively join files, saving a lot of CPU cycles from Teradata server.

It operates in 2 modes - Local and MapReduce.

To invoke the local mode, we type the command as:

bash$> pig -x local

The default mode is Map-Reduce mode.

bash$> pig

will invoke PIG in the Map Reduce mode.

Running any pig script can be accomplished as below:

bash$> pig TestScript.pig

or

After logging into Pig:

grunt> exec TestScript.pig

We will follow up with PIG commands and how PIG can be combined with Teradata to give great performance improvements.

Monday, 17 December 2012

Teradata Inline Stats - How to over-ride the optimiser

Inline Stats are statistics collected in between a script, usually before a SELECT or a JOIN.

It is mostly used to over-ride the optimiser to pick a different PLAN.

For example:

We have a table FACT_SALES with PI on CUS_ID.

But, the Sales_Dimension has to be joined on PROD_ID, a primary key in the Dimension. There is no point joining based on CUS_ID, as it will lead to a spool-join rather than amp-local join.

So, we collect an Inline Stats by using:

Collects stats on Sales_Dim column PROD_ID;

This will help the optimizer to join the 2 tables based on PROD_ID.

Friday, 30 November 2012

ABS - Absolute function in Teradata.
In mathematics, the absolute value (or modulus)

| a |

of a real number

a

is the non-negative value of

a

without regard to its sign.
Mod (not to be confused with Modulus) function in SQL acts differently.

Example:
So, let us put the ABS function to use.
Suppose I have to find the balance from an account table and decide if its a Credit Balance or a Debit Balance. So, we are not concerned about the sign, rather, in this query request we are more interested to compare the values. Hence, use ABS function, so that the signs (+ or -) do not affect your query results.

Query:

SELECT
CUSTOMER_ID
,SUM(ABS(CREDIT_TRNS) AS CREDIT
,SUM(DEBIT_TRNS) AS DEBIT
,CASE WHEN CREDIT > DEBIT THEN 'CREDIT_BALANCE'
WHEN CREDIT < DEBIT THEN 'DEBIT_BALANCE'
END
AS BALANCE_TYPE
FROM
DW_PROD1.DIM_ACCOUNT_TRNS;

The reason for using ABS(CREDIT_TRNS) is because Banks treats deposits as liabilities and are entered as negative values, representing amount to be paid back to corporate houses and individuals.

Friday, 17 August 2012

Fast-export Utility in Teradata with example: How to effecitvely send data from Teradata to Client-applications

Fast-export example:

As the definition says, FastExport is a command-driven utility that uses multiple sessions to quickly transfer large amounts of data from tables and views of the Teradata Database to a client-based application.

Example of Fast-export:

.logtable $LOADDB.LOGTABLE_ARRANGEMENT;

.run File $LOGON_FILE;

.begin export sessions 2;

.export outfile $VAR6 FORMAT TEXT MODE RECORD;

SELECT TRIM(CAST(HEAD1.HEADER AS CHAR(100))) FROM

(

SELECT

'¿'||

TRIM(A.ID_HEADER)

|| '|' ||TRIM(A.CLUSTER_NAME)

|| '|' || TRIM(A.BUS_DATE)

|| '|' ||TRIM(A.PRODUCT_NAME)

|| '|' ||TRIM(A.PROCESS_NAME)

|| '|' ||TRIM(A.FREQUENCY_NAME)

|| '|' ||TRIM(A.VERSION_NO)

|| '¿'

AS HEADER

FROM

(

SELECT

'01' AS ID_HEADER

,TRIM(CLUSTER_NAME) AS CLUSTER_NAME

, CAST(CAST(CAST('$BUS_DATE' AS DATE FORMAT 'YYYY-MM-DD') AS DATE FORMAT 'DDMMMYYYY') AS VARCHAR(12)) AS BUS_DATE

,TRIM(PRODUCT_NAME) AS PRODUCT_NAME

,TRIM(PROCESS_NAME) AS PROCESS_NAME

,TRIM(FREQUENCY_NAME) AS FREQUENCY_NAME

,MAX(VERSION_NO)+1 AS VERSION_NO

FROM

?LOADDB.VERSION_HEAD

WHERE CAST('$BUS_DATE' AS DATE FORMAT 'YYYY-MM-DD')= RUNDATE

AND PRODUCT_NAME='PT_007_LEA'

group by 1,2,3,4,5,6

) AS A

) AS HEAD1

;

.end export;

.logoff;

TIPS:

Convert date fields to varchar(12) before export

Generally Fast-export will append the first 2 bytes with header information, so to eliminate that we have appended '¿' symbol

Use a post-treatment script to remove the '¿' character

Include the following command in UNIX script for post-treatment:

cat $VAR6 | cut -f 2 -d ¿ >$VAR6.txt

Please let me know if you need more information.

Tuesday, 10 July 2012

Why should we Collect Statistics on Partition

Why should we Collect Statistics on Partition?

These are a few details on the importance of collecting stats on PARTITION keyword. Please make sure to add this for all your target tables.

This is more quick method for getting information like select count(*) statements.
This does not take time to collect stats and goes into cylinder headers and collects data.

For accurate information, collect statistics every time any data modification happens to that table.

COLLECT STATISTICS TA_EDW_SCDB.TXN_MASTER_RETAIL COLUMN PARTITION;

*** Update completed. One row changed.

*** Total elapsed time was 44 seconds.

HELP STATS TA_EDW_SCDB.TXN_MASTER_RETAIL COLUMN PARTITION;

Date	Time	Number of Rows	Number of Nulls	Number of All Nulls	Number of AMPs	Average AMP RPV	OneAMPSampleEst	AllAMPSampleEst
11/7/2019	5:10:19	270,155,352	0	0	25	0	270320650	270127450
11/7/2019	5:10:19	270,155,352	0	0	25	0	270320650	270127450