Data Analysis Example: Joins in Teradata - TD join strategies and the join confidences

Teradata internally joins all tables as AMP-local joins. So, even if the data lies in separate AMPs they will re-distributed based on some metadata information stored in the DBC tables. Redistribution is performed to make the joins behave like AMP-local joins.

Now, the join strategy can be defined by many factors. A few frequently occurring confidence levels can tell you about the join performances.

Index Join confidence: 2 tables are joined based on a column A, A being an index in TableA. If the join mentions A=B and column B is neither an index, nor has stats collected, we get a index join confidence. So the optimizer knows that one side is an index, but the confidence is low since information about the other column (B) is not available. This is better than a no confidence.
Index join confidence will lead to data being read using the index sub tables.

Low confidence: Teradata has partial information about the column demographics and hence the confidence is low. Collect stats on the column will lead to higher confidence.

High Confidence: If all the columns in the where clause or the join condition have updated statistics information, then Teradata optimizer can predict the space and time required to complete each step with confidence. This is the best confidence level.

The most common join strategies are:

Product Join
Merge Join
Exclusion Join
Hash Join
Nested Join

Each join strategy has its own pros and cons, and it's hard to say which one is the best, depending on different circumstances. The optimizer will choose the best join strategy based on data demographics, statistics and indexes if any of them are available. Using EXPLAIN can help find out what join strategies are to be adopted.

No matter which join strategy, it is always applied between two tables. The more tables, the more join steps. Rows must be on the same AMP to be joined. So row distribution or duplication is unavoidable for some join strategies.

1. Product Join

This is the most basic and straightforward join strategy. In order to find a match between two tables with a join condition which is not based on equality (>, <, <>), or join conditions are ORed together.

The reason why we call it "Product" join is that, the number of comparisons required is the "product" of the number of rows of both tables. For example, table t1 has 10 rows, and table t2 has 25 rows, then it would require 10x25=250 comparisons to find the matching rows.

When the WHERE clause is missing, it will cause a special product join, called Cartesian Join or Cross Join, which will return all the combination of rows from both tables. In the above example, 250 rows will be returned as the result.

It is referred to as Nested-loops Join by vendors like IBM and Oracle, which also makes sense, when mapping it to the algorithm.

2. Merge Join

This is a much more efficient join strategy. It is adopted when the join conditions are based on equality (=). There is a prerequisite though: the two tables must be sorted based on the join column in advance (actually it's sorted based on the join column row hash sequence). That's why Oracle calls it Sort-Merge Join. That brings a great advantage for this type of join: both tables only need to be scanned once, in an interleaved manner.

Merge join is not necessarily always better than product join, due to the fact that sorting is required. If both tables are huge, sorting can be a tremendous effort.

3. Exclusion Join

This join strategy is used to find non-matching rows. If the query contains "NOT IN" or "EXCEPT", exclusion join will be picked. As a matter of fact, this kind of join can be done as either Merge Join or Product Join.

One thing worth noticing: exclusion merge join is based on set subtraction operation, and a three-value logic (TRUE, FALSE, UNKNOWN) will be used when comparisons is done on nullable columns (or temporary result set).

4. Hash Join

Hash Join gets its name from the fact that one smaller table is built as "hash-table", and potential matching rows from the second table are searched by hashing against the smaller table.

Usually optimizer will first identify a smaller table, and then sort it by the join column row hash sequence. If the smaller table is really small and can fit in the memory, the performance will be best. Otherwise, the sorted smaller table will beduplicated to all the AMPs. Then the larger table is processed one row at a time by doing a binary search of the smaller table for a match.

Hash Join is also based on equality condition (=).

5. Nested Join

Don't get confused with "Nested-loops Join", which is the term used by Oracle, IBM and Microsoft. In Teradata, Product Join is the counterpart of "Nested-loops Join" in other RDBMS.

However, Nested Join can be seen as an enhanced version of the common "Nested-loops Join", where Teradata takes advantage of its index structure. In order to make Nested Join picked, the following conditions must be satisfied:
1) The join condition is based on equality;
2) The join column is a unique index on one table;
3) The join column is any index on another table.

Based on conditions above it is not hard to infer how Nested Join works. First only one single row will be retrieved from one table with the help of the unique index, and then based on the row hash of that row, another table is accessed by some index.

Nested Join is the most efficient join method in Teradata. It is also the only join method that don't always use all the AMPs.

Data Analysis Example

Pages

Tuesday, 14 April 2015

Joins in Teradata - TD join strategies and the join confidences

No comments:

Post a Comment

Knowledge Archive