Tuesday 18 December 2012

Distinct and Group by in Teradata - What to use and when to use?


Distinct and Group-by are both used to identify unique rows in Teradata.

Generally for Large Dataset, Group-by is preferred. Why?


Group-by distributes the task of filtering unique rows to individual AMPs. So, the operation is AMP local is less CPU consuming. Look at the below explain steps to understand better.


  1) First, we lock a distinct PRD_DB."pseudo table" for
     read on a RowHash to prevent global deadlock for
     PRD_DB.FACT_SHC_DLY_OPR_INS.
  2) Next, we lock PRD_DB.FACT_SHC_DLY_OPR_INS for read.
  3) We do an all-AMPs SUM step to aggregate from a single partition of
     PRD_DB.FACT_SHC_DLY_OPR_INS
with a condition of (
     "PRD_DB.FACT_SHC_DLY_OPR_INS.DAY_NBR = DATE
     '2012-12-17'") with a residual condition of (
     "PRD_DB.FACT_SHC_DLY_OPR_INS.DAY_NBR = DATE
     '2012-12-17'") , grouping by field1 (
     PRD_DB.FACT_SHC_DLY_OPR_INS.VEND_PACK_ID
     ,PRD_DB.FACT_SHC_DLY_OPR_INS.LOCN_NBR).  Aggregate Intermediate
     Results are computed locally
, then placed in Spool 3.  The
     aggregate spool file will not be cached in memory.  The size of
     Spool 3 is estimated with high confidence to be 3,776,325,574 rows
     (241,684,836,736 bytes).  The estimated time for this step is
     49.05 seconds.


When we do the same operation using distinct, data is first copied to the spool and then the filter operation is performed.

Distinct is feasible for small dataset whereas Group-by is better for Large data set.

So, what happens when your data is skewed


In this scenario, distinct works faster, since the filter operation is done at Spool Level.
For group-by, the skewed AMP will take relatively longer time to complete its share of operation. And all the participating AMPs will have to wait for the skewed AMP. Hence the performance degrades and TotalAmpCPU increases.

No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered