Saturday, 8 August 2015

Understand how partitioning improves query performance using Teradata EXPLAIN

Today, we will see if adding a partition-by clause improves our SELECT queries in Teradata.

Let us go through the example below:

DATABASE RETAIL;

SELECT * FROM CUSTOMER;


ALTER TABLE CUSTOMER ADD JOINING_DT DATE FORMAT 'YYYY-MM-DD';

UPDATE CUSTOMER SET JOINING_DT = '2015-02-01' WHERE CUST_ID = 1;
UPDATE CUSTOMER SET JOINING_DT = '2015-05-01' WHERE CUST_ID = 201;

/* lET US RUN A SELECT FOR A PARTICULAR MONTH OF THE YEAR */

SELECT * FROM CUSTOMER WHERE JOINING_DT BETWEEN '2015-01-01' AND '2015-01-31';

/* "  1) First, we lock a distinct RETAIL.""pseudo table"" for read on a"
     RowHash to prevent global deadlock for RETAIL.CUSTOMER.
"  2) Next, we lock RETAIL.CUSTOMER for read."
  3) We do an all-AMPs RETRIEVE step from RETAIL.CUSTOMER by way of an
     all-rows scan with a condition of ("(RETAIL.CUSTOMER.JOINING_DT <=
     DATE '2015-01-31') AND (RETAIL.CUSTOMER.JOINING_DT >= DATE
"     '2015-01-01')"") into Spool 1 (group_amps), which is built locally"
     on the AMPs.  The size of Spool 1 is estimated with no confidence
     to be 2 rows (102 bytes).  The estimated time for this step is
     0.03 seconds.
*/

As we see in STEP-3, the SELECT with range based WHERE results in a Full Table Scan. This is not a great way to query large tables.

Let us explore the alternate approach.

CREATE TABLE CUSTOMER_1
(
CUST_ID INTEGER
,CUST_NAME VARCHAR(30)
,JOINING_DT DATE FORMAT 'YYYY-MM-DD'
)
PRIMARY INDEX(JOINING_DT)
PARTITION BY (RANGE_N(
JOINING_DT BETWEEN '2014-01-01' AND '2014-12-31' EACH INTERVAL '1' YEAR
,'2015-01-01' AND '2015-12-31' EACH INTERVAL '1' MONTH
, NO RANGE OR UNKNOWN))
;

We insert the same data as was present in Customer table.

INSERT INTO CUSTOMER_1 SELECT * FROM CUSTOMER;

SELECT * FROM CUSTOMER_1 WHERE JOINING_DT BETWEEN '2015-01-01' AND '2015-01-31';

/* IF WE RERUN THE QUERY ON CUSTOMER TABLE ON THE NEW PARTIIONED TABLE THE EXPLAIN CHANGES AS BELOW

"  1) First, we lock a distinct RETAIL.""pseudo table"" for read on a"
     RowHash to prevent global deadlock for RETAIL.CUSTOMER_1.
"  2) Next, we lock RETAIL.CUSTOMER_1 for read."
  3) We do an all-AMPs RETRIEVE step from a single partition of
     RETAIL.CUSTOMER_1 with a condition of (
     "(RETAIL.CUSTOMER_1.JOINING_DT <= DATE '2015-01-31') AND
     (RETAIL.CUSTOMER_1.JOINING_DT >= DATE '2015-01-01')") into Spool 1
"     (group_amps), which is built locally on the AMPs.  The size of"
     Spool 1 is estimated with low confidence to be 4 rows (204 bytes).
     The estimated time for this step is 0.03 seconds.

*/

As we see now that the partition is leveraged when the partition column is used in the WHERE clause of a SELECT query. This results in faster query execution.

Join us in the classroom for more stimulating sessions on Teradata.


No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered