Tuesday, 8 September 2015

Big Data Lake Implementation - Moving Data from OLTP (MySQL) to HDFS using Apache Sqoop - Example scripts

To persist the entire history data in the Big Data Lake, we started with the ingestion and storage of all records in the OLTP system (based on MySQL) to HDFS cluster.

Below is a sample sqoop import call that allows us to do this with ease.

sqoop import --connect jdbc:mysql://localhost/test_nifi --username root --table apache_nifi_test -m 1


We can also persist the data directly onto a Hive table :

./sqoop import –connect jdbc:mysql://w.x.y.z:3306/db_name –username user_name –password **** –hive-import –hive-database -table oltp_table1 -m 1

The m 1 creates only one file for the table. This is used if you don't have a primary key defined. Sqoop uses the primary key for partitioning the data.

The below diagram will explain the steps we ran to achieve the data copy.

Step 1: Creation of a table on MySQL with data


Copy MySQL data onto HDFS using Sqoop for Big Data Lake implementation

Step 2: Running sqoop to extract data from the MySQL table and dump on HDFS

MySQL to HDFS data transer via Sqoop

As you can see in the diagram, by default the data is copied into :
/user/<login-user>/<tablename>
The path from where you execute the sqoop import command also stores the .java file that was used to import the data.

Let us know if you like our Blog. Thanks!!

Friday, 4 September 2015

CSUM and ROW_NUMBER comparison in Teradata - Example script with statistics and syntax - Surrogate key generation in Teradata

We keep getting questions about which one to use for sequence number generation in Teradata.

Many people use CSUM(1,1) as its easier and other databases support that. But, in Teradata , it is not recommended.


To test this out, we ran the below queries today on a table that contains 60K rows and another with just 1000 rows.


Example script:


--- Using CSUM

SELECT CSUM(1, SEED1.ID_MAX),  L_ORDERKEY
  ,ID_MAX
 FROM ITEMPPI AS E
 CROSS JOIN
 (SELECT 1 AS ID_MAX)AS SEED1;

--- Using Row Number 

 SELECT ROW_NUMBER() OVER(ORDER BY L_ORDERKEY) + SEED1.ID_MAX, L_ORDERKEY
  ,ID_MAX
 FROM ITEMPPI AS E
 CROSS JOIN
 (SELECT 0 AS ID_MAX)AS SEED1;


 SELECT * FROM DBC.DBQLogTbl 

 WHERE CAST(COLLECTTIMESTAMP AS DATE FORMAT 'YYYY-MM-DD')= CURRENT_DATE
 AND SESSIONID = 1624
 ORDER BY COLLECTTIMESTAMP DESC;

When you check the QueryLog after running a CSUM(1,1) you'll notice that a single AMP (usually vproc 0) processed all the data ,resulting in high cpu/io and spool.


For OLAP functions there is distribution of the data based on PARTTITION and ORDER and for CSUM(1,1) there's only 1 partition and no order. This can be seen in the diagram below:



Statistics and query log comparison between CSUM and ROW_NUMBER in Teradata
For a smaller table, the difference was not noteworthy, but for larger datasets, we see a big difference.

Let us know if you need more information. Like us on Twitter or Facebook.


Thursday, 3 September 2015

Multiload script to load multiple tables - Teradata multiload script with Indexes on Table


Teradata Multiload allows us to UPDATE or INSERT data into non-empty tables. MLOAD internally does the operation using 5 phases.
Phases of Mload are:
1.       Multiload Initial Phase
2.       Multiload DML transaction Phase
3.       Multiload Acquisistion Phase
4.       Multiload Application Phase
5.       Multiload Cleanup Phase

5 tables can be loaded using a single multiload script if they are using the same file.
If you need to restart a failed MLOAD job:
Drop the log tables:
drop table financial.log_staging_cust_present_1;
drop table financial.staging_cust_present_ET_1;
drop table financial.staging_cust_present_UV_1;
drop table ET_staging_cust_present_1;
drop table UV_staging_cust_present_1;

Release the mload lock on the table:

release mload staging_cust_present_1 ;
release mload staging_cust_present ;

Adding  Unique secondary indexes on a table and trying to load data using Multiload, gives an error as shown below:
**** 03:30:30 UTY0805 RDBMS failure, 3547: MLoad target table may not contain
     unique secondary indexes.

This is because USI is a 2-AMP operation.

Index support for Multiload is as below:

1.       NUPI
2.       UPI
3.       PPI
4.       NUSI




Please run the below steps to validate the index support.
CREATE MULTISET TABLE financial.staging_cust_present     (
      NAME VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      DOB DATE FORMAT 'yyyy-mm-dd')
PRIMARY INDEX ( NAME )
INDEX idx_staging_cust ( DOB );

CREATE MULTISET TABLE financial.staging_cust_present_1      (
      NAME VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      DOB DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( NAME ,DOB )
PARTITION BY RANGE_N(DOB  BETWEEN '2015-01-01' AND '2015-12-31' EACH INTERVAL '1' MONTH ,
 NO RANGE OR UNKNOWN);

CREATE MULTISET TABLE financial.staging_cust_present_2     (
      NAME VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      DOB DATE FORMAT 'YY/MM/DD')
UNIQUE PRIMARY INDEX ( NAME ,DOB )
PARTITION BY RANGE_N(DOB  BETWEEN '2015-01-01' AND '2015-12-31' EACH INTERVAL '1' MONTH ,
 NO RANGE OR UNKNOWN)
UNIQUE INDEX(DOB);

Execute the attached mload script as: mload < mload_cust_online_present_2.mload

Observation:  If you have USI, the mload will fail. When we remove the Unique secondary index, then the below log file will be generated



The 5 phases of MLOAD are highlighted in yellow:

TDExpress14.10.01_Sles10:~/Teradata/finance # mload < mload_cust_online_present_2.mload
     ========================================================================
     =                                                                      =
     =          MultiLoad Utility    Release MLOD.14.10.00.04               =
     =          Platform LINUX                                              =
     =                                                                      =
     ========================================================================
     =                                                                      =
     =     Copyright 1990-2013 Teradata Corporation. ALL RIGHTS RESERVED.   =
     =                                                                      =
     ========================================================================
**** 04:23:55 UTY2411 Processing start date: WED APR 01, 2015
     ========================================================================
     =                                                                      =
     =          Logon/Connection                                            =
     =                                                                      =
     ========================================================================
0001 .LOGTABLE financial.log_staging_cust_present_1;

0002 .LOGON 127.0.01/dbc,;
**** 04:23:55 UTY8400 Current RDBMS supports Enhanced Statement Status
**** 04:23:55 UTY8400 Teradata Database Release: 14.10.01.02
**** 04:23:55 UTY8400 Teradata Database Version: 14.10.01.01
**** 04:23:55 UTY8400 Default character set: ASCII
**** 04:23:55 UTY8400 Current RDBMS has interval support
**** 04:23:55 UTY8400 Current RDBMS has UDT support
**** 04:23:55 UTY8400 Current RDBMS has Large Decimal support
**** 04:23:55 UTY8400 Current RDBMS has TASM support
**** 04:23:55 UTY8400 Maximum supported buffer size: 1M
**** 04:23:55 UTY8400 Data Encryption supported by RDBMS server
**** 04:23:55 UTY6211 A successful connect was made to the RDBMS.
**** 04:23:55 UTY6210 Logtable 'financial.log_staging_cust_present_1' indicates
     that a restart is in progress.
     ========================================================================
     =                                                                      =
     =          Processing Control Statements                               =
     =                                                                      =
     ========================================================================
0003 sessions 5;
**** 04:23:55 UTY1014 Unsupported RDBMS/SQL statement: "SESSIONS"
0004 tenacity 1;
**** 04:23:55 UTY1014 Unsupported RDBMS/SQL statement: "TENACITY"

0005 database financial;
**** 04:23:55 UTY1012 A restart is in progress.  This request has already been
     executed.  The return code was: 0.
**** 04:23:55 UTY1024 Session modal request, 'DATABASE', re-executed.
0006 drop table financial.staging_cust_present_WK;
**** 04:23:55 UTY1012 A restart is in progress.  This request has already been
     executed.  The return code was: 0.
0007 drop table financial.staging_cust_present_ET_1;
**** 04:23:55 UTY1011 A restart is in progress.  This request has already been
     executed.  The return code was: 3807, accompanied by the following message
     text:
     Object 'financial.staging_cust_present_ET_1' does not exist.
0008 drop table financial.staging_cust_present_UV_1;
**** 04:23:55 UTY1011 A restart is in progress.  This request has already been
     executed.  The return code was: 3807, accompanied by the following message
     text:
     Object 'financial.staging_cust_present_UV_1' does not exist.

0009 .BEGIN IMPORT MLOAD
     TABLES financial.staging_cust_present,financial.staging_cust_present_1
     WORKTABLES
     financial.staging_cust_present_WK,financial.staging_cust_present_1_WK
     ERRORTABLES financial.staging_cust_present_ET_1
     financial.staging_cust_present_UV_1;
     ========================================================================
     =                                                                      =
     =          Processing MultiLoad Statements                             =
     =                                                                      =
     ========================================================================

0010 .LAYOUT DATAIN_LAYOUT;
0011 .FIELD c_name 1 VARCHAR(30);
0012 .FIELD DOB * varchar(10);

0013 .DML LABEL UPSERT_DML
     do insert for missing update rows;
0014 update financial.staging_cust_present
     set
     ---name = :c_name
     dob = :dob
     where name = :c_name;

0015 INSERT INTO financial.staging_cust_present
     (
     name = :c_name
     ,dob = :dob(date, format 'ddmmyyyy')
     );

0016 .dml label insert_dml;
0017 INSERT INTO financial.staging_cust_present_1
     (
     name = :c_name
     ,dob = :dob(date, format 'ddmmyyyy')
     );


0018 .IMPORT INFILE fld1.csv
     FORMAT vartext '|' LAYOUT DATAIN_LAYOUT
     apply upsert_dml
     APPLY INSERT_DML
     ;

0019 .END MLOAD;
     ========================================================================
     =                                                                      =
     =          MultiLoad Initial Phase                                     =
     =                                                                      =
     ========================================================================
**** 04:23:55 UTY0829 Options in effect for this MultiLoad import task:
     .       Sessions:    One session per available amp.
     .       Checkpoint:  15 minute(s).
     .       Tenacity:    4 hour limit to successfully connect load sessions.
     .       Errlimit:    No limit in effect.
     .       AmpCheck:    In effect for apply phase transitions.
**** 04:23:55 UTY0817 MultiLoad submitting the following request:
     Select NULL from financial.log_staging_cust_present_1 where (LogType = 125)
     and (Seq = 1) and (MloadSeq = 0);
**** 04:23:55 UTY0817 MultiLoad submitting the following request:
     Select NULL from financial.log_staging_cust_present_1 where (LogType = 120)
     and (Seq = 1);
**** 04:23:55 UTY0817 MultiLoad submitting the following request:
     SET QUERY_BAND='UTILITYNAME=MULTLOAD;' UPDATE FOR SESSION;
**** 04:23:55 UTY0817 MultiLoad submitting the following request:
     CHECK WORKLOAD FOR BEGIN MLOAD
     financial.staging_cust_present,financial.staging_cust_present_1;
**** 04:23:55 UTY0817 MultiLoad submitting the following request:
     CHECK WORKLOAD END;
**** 04:23:57 UTY0815 MLOAD session(s) connected: 2.
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     BEGIN MLOAD financial.staging_cust_present,financial.staging_cust_present_1
     WITH INTERVAL;
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     Select NULL from financial.log_staging_cust_present_1 where (LogType = 130)
     and (Seq = 1) and (MloadSeq = 10);
     ========================================================================
     =                                                                      =
     =          MultiLoad DML Transaction Phase                             =
     =                                                                      =
     ========================================================================
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     BT;
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     USING C_NAME(VARCHAR(30)), DOB(VARCHAR(10)) update
     financial.staging_cust_present
     set
     ---name = :c_name
     dob = :dob
     where name = :c_name;
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     USING C_NAME(VARCHAR(30)), DOB(VARCHAR(10)) INSERT INTO
     financial.staging_cust_present
     (
     name = :c_name
     ,dob = :dob(date, format 'ddmmyyyy')
     );
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     USING C_NAME(VARCHAR(30)), DOB(VARCHAR(10)) INSERT INTO
     financial.staging_cust_present_1
     (
     name = :c_name
     ,dob = :dob(date, format 'ddmmyyyy')
     );
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq,
     MLoadSeq)VALUES(130, 1, 20);
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     ET;
     ========================================================================
     =                                                                      =
     =          MultiLoad Acquisition Phase                                 =
     =                                                                      =
     ========================================================================
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     BEGIN TRANSACTION;
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     CHECKPOINT LOADING INTERVAL 0;
**** 04:23:57 UTY0817 MultiLoad submitting the following request:
     CHECKPOINT LOADING INTERVAL 0;
**** 04:23:58 UTY0817 MultiLoad submitting the following request:
     CHECKPOINT LOADING INTERVAL 0;
**** 04:23:59 UTY0817 MultiLoad submitting the following request:
     USING Ckpt(VARBYTE(1024)) INSERT financial.log_staging_cust_present_1
       (Logtype, Seq, MLoadSeq, MLoadImpSeq, MLoadSrcSeq,
     MiscInt1,MiscInt2,MiscInt3,MiscInt4,
     MiscInt5,MiscInt6,MiscInt7,MiscInt8,CkptInterval,byteflag,MLoadCkpt)
      VALUES (110, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, :Ckpt);
**** 04:23:59 UTY0817 MultiLoad submitting the following request:
     CHECKPOINT LOADING INTERVAL 0;
**** 04:24:00 UTY0817 MultiLoad submitting the following request:
     USING Ckpt(VARBYTE(1024)) INSERT financial.log_staging_cust_present_1
       (Logtype, Seq, MLoadSeq, MLoadImpSeq, MLoadSrcSeq,
     MiscInt1,MiscInt2,MiscInt3,MiscInt4,
     MiscInt5,MiscInt6,MiscInt7,MiscInt8,CkptInterval,byteflag,MLoadCkpt)
      VALUES (110, 1, 1, 1, 0, 4, 4, 8, 8, 0, 0, 0, 0, 0, 1, :Ckpt);
**** 04:24:00 UTY0826 A checkpoint has been taken, recording that end of file
     has been reached for IMPORT 1 of this MultiLoad Import task.
**** 04:24:00 UTY1803 Import processing statistics
     .                                       IMPORT  1     Total thus far
     .                                       =========     ==============
     Candidate records considered:........           4.......           4
     Apply conditions satisfied:..........           8.......           8
     Candidate records not applied:.......           0.......           0
     Candidate records rejected:..........           0.......           0
**** 04:24:00 UTY1821 Aquisition Phase statistics
     Elapsed time:  00:00:03
     CPU time:      0.01 Seconds
     MB/sec:        4.86667e-05
     MB/cpusec:     0.0146
**** 04:24:00 UTY0817 MultiLoad submitting the following request:
     CHECKPOINT LOADING INTERVAL 0;
**** 04:24:01 UTY0817 MultiLoad submitting the following request:
     CHECKPOINT LOADING INTERVAL 0 END;
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq,
     MLoadSeq)VALUES(130, 1, 30);
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     ET;
     ========================================================================
     =                                                                      =
     =          MultiLoad Application Phase                                 =
     =                                                                      =
     ========================================================================
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     EXEC MLOAD financial.staging_cust_present;EXEC MLOAD
     financial.staging_cust_present_1;
**** 04:24:02 UTY0818 Statistics for table financial.staging_cust_present:
        Inserts:                    0
        Updates:                    0
        Deletes:                    0
**** 04:24:02 UTY0818 Statistics for table financial.staging_cust_present_1:
        Inserts:                    0
        Updates:                    0
        Deletes:                    0
**** 04:24:02 UTY1822 Application Phase statistics
     Elapsed time:  00:00:00
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq) VALUES (115, 1)
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     BEGIN TRANSACTION;
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     END MLOAD;
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq) VALUES (120, 1)
**** 04:24:02 UTY0817 MultiLoad submitting the following request:
     USING Ckpt(VARBYTE(1024)) INS financial.log_staging_cust_present_1
     (LogType, Seq, MLoadCkpt)VALUES(135, 1, :Ckpt);
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     END TRANSACTION;
     ========================================================================
     =                                                                      =
     =          MultiLoad Task Cleanup                                      =
     =                                                                      =
     ========================================================================
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     SELECT COUNT(*) FROM financial.staging_cust_present_ET_1;
**** 04:24:03 UTY0846 Error table financial.staging_cust_present_ET_1 contains 4
     rows.
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     SELECT COUNT(*) FROM financial.staging_cust_present_UV_1;
**** 04:24:03 UTY0821 Error table financial.staging_cust_present_UV_1 is EMPTY,
     dropping table.
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     BEGIN TRANSACTION;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq,
     MLoadSeq)VALUES(125, 1, 2)
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     DROP TABLE financial.staging_cust_present_UV_1;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     END TRANSACTION;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     BEGIN TRANSACTION;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq,
     MLoadSeq)VALUES(125, 1, 3)
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     DROP TABLE financial.staging_cust_present_WK;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     END TRANSACTION;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     SELECT COUNT(*) FROM financial.ET_staging_cust_present_1;
**** 04:24:03 UTY0846 Error table financial.ET_staging_cust_present_1 contains 2
     rows.
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     SELECT COUNT(*) FROM financial.UV_staging_cust_present_1;
**** 04:24:03 UTY0846 Error table financial.UV_staging_cust_present_1 contains 2
     rows.
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     BEGIN TRANSACTION;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq,
     MLoadSeq)VALUES(125, 1, 6)
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     DROP TABLE financial.staging_cust_present_1_WK;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     END TRANSACTION;
**** 04:24:03 UTY0825 Error table statistics for:

     Target table 1: financial.staging_cust_present
     Number of Rows  Error Table Name
     ==============  ========================================================
                  4  financial.staging_cust_present_ET_1
                  0  financial.staging_cust_present_UV_1

     Target table 2: financial.staging_cust_present_1
     Number of Rows  Error Table Name
     ==============  ========================================================
                  2  financial.ET_staging_cust_present_1
                  2  financial.UV_staging_cust_present_1

**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     BEGIN TRANSACTION;
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     USING Ckpt(VARBYTE(1024)) INS financial.log_staging_cust_present_1
     (LogType, Seq, MLoadCkpt)VALUES(140, 1, :Ckpt);
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     INS financial.log_staging_cust_present_1 (LogType, Seq) VALUES (125, 1)
**** 04:24:03 UTY0817 MultiLoad submitting the following request:
     END TRANSACTION;
**** 04:24:05 UTY0822 MultiLoad processing complete for this MultiLoad import
     task.
     ========================================================================
     =                                                                      =
     =          MultiLoad Task Complete                                     =
     =                                                                      =
     ========================================================================
**** 04:24:05 UTY1024 Session modal request, 'DATABASE', re-executed.
**** 04:24:05 UTY1024 Session modal request, 'SET
     QUERY_BAND='UTILITYNAME=MULTLOAD;' UPDATE FOR SESSION;', re-executed.
     ========================================================================
     =                                                                      =
     =          Processing Control Statements                               =
     =                                                                      =
     ========================================================================

0020 .LOGOFF;
     ========================================================================
     =                                                                      =
     =          Logoff/Disconnect                                           =
     =                                                                      =
     ========================================================================
**** 04:24:05 UTY6216 The restart log table has been dropped.
**** 04:24:05 UTY6212 A successful disconnect was made from the RDBMS.
**** 04:24:05 UTY2410 Total processor time used = '1.1 Seconds'
     .       Start : 04:23:55 - WED APR 01, 2015
     .       End   : 04:24:05 - WED APR 01, 2015
     .       Highest return code encountered = '0'.