We keep getting questions about which one to use for sequence number generation in Teradata.
Many people use CSUM(1,1) as its easier and other databases support that. But, in Teradata , it is not recommended.
To test this out, we ran the below queries today on a table that contains 60K rows and another with just 1000 rows.
Example script:
--- Using CSUM
SELECT CSUM(1, SEED1.ID_MAX), L_ORDERKEY
,ID_MAX
FROM ITEMPPI AS E
CROSS JOIN
(SELECT 1 AS ID_MAX)AS SEED1;
--- Using Row Number
SELECT ROW_NUMBER() OVER(ORDER BY L_ORDERKEY) + SEED1.ID_MAX, L_ORDERKEY
,ID_MAX
FROM ITEMPPI AS E
CROSS JOIN
(SELECT 0 AS ID_MAX)AS SEED1;
SELECT * FROM DBC.DBQLogTbl
WHERE CAST(COLLECTTIMESTAMP AS DATE FORMAT 'YYYY-MM-DD')= CURRENT_DATE
AND SESSIONID = 1624
ORDER BY COLLECTTIMESTAMP DESC;
When you check the QueryLog after running a CSUM(1,1) you'll notice that a single AMP (usually vproc 0) processed all the data ,resulting in high cpu/io and spool.
For OLAP functions there is distribution of the data based on PARTTITION and ORDER and for CSUM(1,1) there's only 1 partition and no order. This can be seen in the diagram below:
For a smaller table, the difference was not noteworthy, but for larger datasets, we see a big difference.
Let us know if you need more information. Like us on Twitter or Facebook.
Many people use CSUM(1,1) as its easier and other databases support that. But, in Teradata , it is not recommended.
To test this out, we ran the below queries today on a table that contains 60K rows and another with just 1000 rows.
Example script:
--- Using CSUM
SELECT CSUM(1, SEED1.ID_MAX), L_ORDERKEY
,ID_MAX
FROM ITEMPPI AS E
CROSS JOIN
(SELECT 1 AS ID_MAX)AS SEED1;
--- Using Row Number
SELECT ROW_NUMBER() OVER(ORDER BY L_ORDERKEY) + SEED1.ID_MAX, L_ORDERKEY
,ID_MAX
FROM ITEMPPI AS E
CROSS JOIN
(SELECT 0 AS ID_MAX)AS SEED1;
SELECT * FROM DBC.DBQLogTbl
WHERE CAST(COLLECTTIMESTAMP AS DATE FORMAT 'YYYY-MM-DD')= CURRENT_DATE
AND SESSIONID = 1624
ORDER BY COLLECTTIMESTAMP DESC;
When you check the QueryLog after running a CSUM(1,1) you'll notice that a single AMP (usually vproc 0) processed all the data ,resulting in high cpu/io and spool.
For OLAP functions there is distribution of the data based on PARTTITION and ORDER and for CSUM(1,1) there's only 1 partition and no order. This can be seen in the diagram below:
Statistics and query log comparison between CSUM and ROW_NUMBER in Teradata |
Let us know if you need more information. Like us on Twitter or Facebook.
No comments:
Post a Comment
Please share your thoughts and let us know the topics you want covered