Distinct and Group-by are both used to identify unique rows in Teradata.
Generally for Large Dataset, Group-by is preferred. Why?
Group-by distributes the task of filtering unique rows to individual AMPs. So, the operation is AMP local is less CPU consuming. Look at the below explain steps to understand better.
1) First, we lock a distinct PRD_DB."pseudo table" for
read on a RowHash to prevent global deadlock for
PRD_DB.FACT_SHC_DLY_OPR_INS.
2) Next, we lock PRD_DB.FACT_SHC_DLY_OPR_INS for read.
3) We do an all-AMPs SUM step to aggregate from a single partition of
PRD_DB.FACT_SHC_DLY_OPR_INS with a condition of (
"PRD_DB.FACT_SHC_DLY_OPR_INS.DAY_NBR = DATE
'2012-12-17'") with a residual condition of (
"PRD_DB.FACT_SHC_DLY_OPR_INS.DAY_NBR = DATE
'2012-12-17'") , grouping by field1 (
PRD_DB.FACT_SHC_DLY_OPR_INS.VEND_PACK_ID
,PRD_DB.FACT_SHC_DLY_OPR_INS.LOCN_NBR). Aggregate Intermediate
Results are computed locally, then placed in Spool 3. The
aggregate spool file will not be cached in memory. The size of
Spool 3 is estimated with high confidence to be 3,776,325,574 rows
(241,684,836,736 bytes). The estimated time for this step is
49.05 seconds.
When we do the same operation using distinct, data is first copied to the spool and then the filter operation is performed.
Distinct is feasible for small dataset whereas Group-by is better for Large data set.
So, what happens when your data is skewed
In this scenario, distinct works faster, since the filter operation is done at Spool Level.
For group-by, the skewed AMP will take relatively longer time to complete its share of operation. And all the participating AMPs will have to wait for the skewed AMP. Hence the performance degrades and TotalAmpCPU increases.
No comments:
Post a Comment
Please share your thoughts and let us know the topics you want covered