Monday, 10 December 2012

The Teradata ANSI Merge - Implementing UPSERT in Teradata

Update and Insert - The Concept


In SQL, we have the concept of UPDATE Else INSERT, which means:

If I have rows that are already in the table and their attributes changed, they should be updated.
Else, if I have new rows, insert them into the table.

Now, Teradata has a new feature since V12.0, known as the ANSI Merge.

Key benefits/Advantages:

1. ANSI Merge is more portable across platforms
2. ANSI Merge takes care of which rows to update and which ones to insert, based on the columns specified in the SET block of Update
3. ANSI Merge ignores rows that are not included in the SET column list.
4. ANSI Merge improves the performance of the queries. This is due to the fact that ANSI Merge is a single-table read, whereas traditional UPSERT is a 2 table-table read.
It uses a new step called MRM(Merge Row Multiple Operations) which does updates and inserts in a single step.

Disadvantages:

1. The MERGE Using Table(Derived table S) and the Target table (ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS) in the example, when joined, should return unique rows.
If duplicates, or multiple entries, qualify for UPDATE, quite logically, MERGE fails.


Example:


MERGE INTO ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS
USING  (
SELECT  VEND_PACK_OPR_ID,  LOCN_OPR_NBR,   DAY_NBR,  MDS_STS,  RPL_TYP_CD,  INS_TYP_CD,   INS_SUB_TYP_CD,
                  SUM(UN_QT) AS UN_QT, SUM(CST_DLR) AS CST_DLR, SUM(SLL_DLR) AS SLL_DLR
FROM STG_VIEWS_PRD.INCR_SHC_DLY_OPR_INS WHERE DAY_NBR <> '9999-09-09'
GROUP BY  VEND_PACK_OPR_ID,   LOCN_OPR_NBR,   DAY_NBR,  MDS_STS,   RPL_TYP_CD,  INS_TYP_CD,   INS_SUB_TYP_CD ) S
ON  ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.VEND_PACK_ID=S.VEND_PACK_OPR_ID
AND ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.LOCN_NBR=S.LOCN_OPR_NBR
AND ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.INS_TYP_CD=S.INS_TYP_CD
AND ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.INS_SUB_TYP_CD=S.INS_SUB_TYP_CD
AND ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.DAY_NBR = S.DAY_NBR


WHEN MATCHED THEN UPDATE
SET
    RPL_TYP_CD = S.RPL_TYP_CD
 ,  TTL_UN_QT = S.UN_QT
 ,   TTL_CST_DLR = S.CST_DLR
 ,   TTL_SLL_DLR = S.SLL_DLR
 ,   PVE_UN_QT   = CASE WHEN S.UN_QT > 0 THEN S.UN_QT ELSE 0 END
 ,   PVE_CST_DLR = CASE WHEN S.UN_QT > 0 THEN S.CST_DLR ELSE 0 END
 ,   PVE_SLL_DLR = CASE WHEN S.UN_QT > 0 THEN S.SLL_DLR ELSE 0 END

WHEN NOT MATCHED THEN INSERT
VALUES (
    S.VEND_PACK_OPR_ID,
    S.LOCN_OPR_NBR,
    S.DAY_NBR,
    S.MDS_STS,
    S.RPL_TYP_CD,
    S.INS_TYP_CD,
    S.INS_SUB_TYP_CD,
    S.UN_QT,
    S.CST_DLR,
    S.SLL_DLR,
        CASE WHEN S.UN_QT > 0 THEN S.UN_QT ELSE 0 END,
        CASE WHEN S.UN_QT > 0 THEN S.CST_DLR ELSE 0 END,
        CASE WHEN S.UN_QT > 0 THEN S.SLL_DLR ELSE 0 END
);


Explain:

So, lets see what the explain plan says:

1) First, we lock a distinct ARP_TBLS_PRD."pseudo table" for
     write
on a RowHash to prevent global deadlock for
     ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.
  2) Next, we lock STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS for access,
     and we lock ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS for write.
  3) We do an all-AMPs SUM step to aggregate from
     STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS by way of an all-rows scan
     with a condition of (
     "STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.DAY_NBR <> DATE
     '9999-09-09'") , grouping by field1 (
     STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.VEND_PACK_OPR_ID
     ,STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.LOCN_OPR_NBR
     ,STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.DAY_NBR
     ,STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.MDS_STS
     ,STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.RPL_TYP_CD
     ,STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.INS_TYP_CD
     ,STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.INS_SUB_TYP_CD).
     Aggregate Intermediate Results are computed locally, then placed
     in Spool 3.  The size of Spool 3 is estimated with no confidence
     to be 7,696,317 rows (692,668,530 bytes).  The estimated time for
     this step is 0.08 seconds.
  4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
     an all-rows scan into Spool 1 (used to materialize view, derived
     table or table function S) (all_amps) (compressed columns allowed),
     which is redistributed by the rowkey of (
     STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.VEND_PACK_OPR_ID,
     STG_TBLS_PRD.INCR_SHC_DLY_OPR_INS.DAY_NBR) to all AMPs.  Then
     we do a SORT to partition Spool 1 by rowkey.  The size of Spool 1
     is estimated with no confidence to be 7,696,317 rows (461,779,020
     bytes).  The estimated time for this step is 0.04 seconds. 



This the portion mentioned in Point 3. So only those columns listed in this condition are considered for Merge. Any other column, will be ignored by Merge.

How it helps?

If you have 50,000 rows, of which only 1000 rows changed. Merge will automatically update the 1000 rows and not touch the remaining 49,000 rows.
Try to make sure that the source and target tables have the same PI and PPI, which results in even faster operation.


  5) We do an all-AMPs merge with matched updates and unmatched inserts
     into ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS from Spool 1 (Last
     Use) with a condition of (
     "(ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.VEND_PACK_ID =
     VEND_PACK_OPR_ID) AND
     ((ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.LOCN_NBR =
     LOCN_OPR_NBR) AND
     ((ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.INS_TYP_CD =
     INS_TYP_CD) AND
     ((ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.INS_SUB_TYP_CD =
     INS_SUB_TYP_CD) AND
     (ARP_TBLS_PRD.WRK_FACT_SHC_DLY_OPR_INS.DAY_NBR = DAY_NBR ))))").
     The number of rows merged is estimated with no confidence to be
     7,696,317 rows.
  6) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> No rows are returned to the user as the result of statement 1.

No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered