Friday, 4 September 2015

CSUM and ROW_NUMBER comparison in Teradata - Example script with statistics and syntax - Surrogate key generation in Teradata

We keep getting questions about which one to use for sequence number generation in Teradata.

Many people use CSUM(1,1) as its easier and other databases support that. But, in Teradata , it is not recommended.


To test this out, we ran the below queries today on a table that contains 60K rows and another with just 1000 rows.


Example script:


--- Using CSUM

SELECT CSUM(1, SEED1.ID_MAX),  L_ORDERKEY
  ,ID_MAX
 FROM ITEMPPI AS E
 CROSS JOIN
 (SELECT 1 AS ID_MAX)AS SEED1;

--- Using Row Number 

 SELECT ROW_NUMBER() OVER(ORDER BY L_ORDERKEY) + SEED1.ID_MAX, L_ORDERKEY
  ,ID_MAX
 FROM ITEMPPI AS E
 CROSS JOIN
 (SELECT 0 AS ID_MAX)AS SEED1;


 SELECT * FROM DBC.DBQLogTbl 

 WHERE CAST(COLLECTTIMESTAMP AS DATE FORMAT 'YYYY-MM-DD')= CURRENT_DATE
 AND SESSIONID = 1624
 ORDER BY COLLECTTIMESTAMP DESC;

When you check the QueryLog after running a CSUM(1,1) you'll notice that a single AMP (usually vproc 0) processed all the data ,resulting in high cpu/io and spool.


For OLAP functions there is distribution of the data based on PARTTITION and ORDER and for CSUM(1,1) there's only 1 partition and no order. This can be seen in the diagram below:



Statistics and query log comparison between CSUM and ROW_NUMBER in Teradata
For a smaller table, the difference was not noteworthy, but for larger datasets, we see a big difference.

Let us know if you need more information. Like us on Twitter or Facebook.


No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered