Data Analysis Example: April 2015

Monday, 20 April 2015

Find the combination of strings in all orders and delete the duplicate records - Amazing Interview Question

This was a question posted to me in the in-the premises round of a leading E-commerce company.

Problem; Find the count of strings and their combinations in a table, and count all the combination of letters as duplicates.

If AB is a string, then BA is a duplicate string. And count(AB) should be 2.

The Solution attempted by me is as follows:

create table table1
(
value1 varchar(30));

Now , in the list below, AB and BA are treated as duplicates.
Similarly, 'CD' and 'DC' are duplicates.
We need to find the count of all combinations possible for 'AB' and 'CD' .

insert into table1 values ('AB');
insert into table1 values ('BA');
insert into table1 values ('CD');
insert into table1 values ('DC');

select * from table1;

select A.value1,count(*) from table1 as t1
inner join
(
select
case when value1 > value_ref then value_ref else value1 end as value1
, value_ref
, value1 as val1

from
(
select value1, concat(substr(value1,2,1),substr(value1,1,1)) as value_ref from table1
group by value_ref
) as A) as A
on t1.value1 = A.value_ref
group by A.value1;

The trick is to create a lookup with all the values in the table.
The inner query does this trick as shown below:

Lookup table of values and their combination

Result set will be:

Count of all combinations possible

Write your solution in the comment section.

Tuesday, 14 April 2015

Joins in Teradata - TD join strategies and the join confidences

Teradata internally joins all tables as AMP-local joins. So, even if the data lies in separate AMPs they will re-distributed based on some metadata information stored in the DBC tables. Redistribution is performed to make the joins behave like AMP-local joins.

Now, the join strategy can be defined by many factors. A few frequently occurring confidence levels can tell you about the join performances.

Index Join confidence: 2 tables are joined based on a column A, A being an index in TableA. If the join mentions A=B and column B is neither an index, nor has stats collected, we get a index join confidence. So the optimizer knows that one side is an index, but the confidence is low since information about the other column (B) is not available. This is better than a no confidence.
Index join confidence will lead to data being read using the index sub tables.

Low confidence: Teradata has partial information about the column demographics and hence the confidence is low. Collect stats on the column will lead to higher confidence.

High Confidence: If all the columns in the where clause or the join condition have updated statistics information, then Teradata optimizer can predict the space and time required to complete each step with confidence. This is the best confidence level.

The most common join strategies are:

Product Join
Merge Join
Exclusion Join
Hash Join
Nested Join

Each join strategy has its own pros and cons, and it's hard to say which one is the best, depending on different circumstances. The optimizer will choose the best join strategy based on data demographics, statistics and indexes if any of them are available. Using EXPLAIN can help find out what join strategies are to be adopted.

No matter which join strategy, it is always applied between two tables. The more tables, the more join steps. Rows must be on the same AMP to be joined. So row distribution or duplication is unavoidable for some join strategies.

1. Product Join

This is the most basic and straightforward join strategy. In order to find a match between two tables with a join condition which is not based on equality (>, <, <>), or join conditions are ORed together.

The reason why we call it "Product" join is that, the number of comparisons required is the "product" of the number of rows of both tables. For example, table t1 has 10 rows, and table t2 has 25 rows, then it would require 10x25=250 comparisons to find the matching rows.

When the WHERE clause is missing, it will cause a special product join, called Cartesian Join or Cross Join, which will return all the combination of rows from both tables. In the above example, 250 rows will be returned as the result.

It is referred to as Nested-loops Join by vendors like IBM and Oracle, which also makes sense, when mapping it to the algorithm.

2. Merge Join

This is a much more efficient join strategy. It is adopted when the join conditions are based on equality (=). There is a prerequisite though: the two tables must be sorted based on the join column in advance (actually it's sorted based on the join column row hash sequence). That's why Oracle calls it Sort-Merge Join. That brings a great advantage for this type of join: both tables only need to be scanned once, in an interleaved manner.

Merge join is not necessarily always better than product join, due to the fact that sorting is required. If both tables are huge, sorting can be a tremendous effort.

3. Exclusion Join

This join strategy is used to find non-matching rows. If the query contains "NOT IN" or "EXCEPT", exclusion join will be picked. As a matter of fact, this kind of join can be done as either Merge Join or Product Join.

One thing worth noticing: exclusion merge join is based on set subtraction operation, and a three-value logic (TRUE, FALSE, UNKNOWN) will be used when comparisons is done on nullable columns (or temporary result set).

4. Hash Join

Hash Join gets its name from the fact that one smaller table is built as "hash-table", and potential matching rows from the second table are searched by hashing against the smaller table.

Usually optimizer will first identify a smaller table, and then sort it by the join column row hash sequence. If the smaller table is really small and can fit in the memory, the performance will be best. Otherwise, the sorted smaller table will beduplicated to all the AMPs. Then the larger table is processed one row at a time by doing a binary search of the smaller table for a match.

Hash Join is also based on equality condition (=).

5. Nested Join

Don't get confused with "Nested-loops Join", which is the term used by Oracle, IBM and Microsoft. In Teradata, Product Join is the counterpart of "Nested-loops Join" in other RDBMS.

However, Nested Join can be seen as an enhanced version of the common "Nested-loops Join", where Teradata takes advantage of its index structure. In order to make Nested Join picked, the following conditions must be satisfied:
1) The join condition is based on equality;
2) The join column is a unique index on one table;
3) The join column is any index on another table.

Based on conditions above it is not hard to infer how Nested Join works. First only one single row will be retrieved from one table with the help of the unique index, and then based on the row hash of that row, another table is accessed by some index.

Nested Join is the most efficient join method in Teradata. It is also the only join method that don't always use all the AMPs.

Recursive query in Teradata - Definition and Example using the WITH RECURSIVE keyword

Recursive Queries use the SEED query to iterate over the RECURSIVE block until the block is empty.

Please go through the example below to understand the implementation.

/* Create a table to hold the data */

create multiset table financial.employee

(emp_id integer,

mgr_id integer,

ename varchar(50)

)primary index(emp_id);

insert into financial.employee(2000,,'Oliver');
insert into financial.employee(2001,2000,'tom');
insert into financial.employee(2002,2001,'santosh');
insert into financial.employee(2003,2001,'preeti');
insert into financial.employee(2004,2001,'sree');
insert into financial.employee(2005,2001,'tapas');
insert into financial.employee(2006,2001,'mani');
insert into financial.employee(2007,2002,'prabhu');

select * from financial.employee;

database financial;

As we see in the above data, each employee has a manager assigned to him/her. If the employee with id 2007 wants to query the table to know his managers (Level 1, Level 2 and so on), he should use a recursive query.

The RECURSIVE block (not in bold) will be executed and a row will be inserted in mgr_tbl( the recursive table). The seed is for employee 'Prabhu'. The iteration will continue till the time all the depth are not covered.

/* Let us now write a recursive query to find all the employee manager ladders */

/* The SEED part is marked in bold. The recursive part is executed till the time it returns rows */

WITH RECURSIVE mgr_tbl(emp_id, mgr_id, mgr_name, depth) AS

(

SELECT a.emp_id, a.mgr_id, a1.ename as mgr_name, 1 as depth

FROM employee a

inner join employee a1

on a.mgr_id = a1.emp_id

where a.emp_id = 2007

UNION ALL

SELECT mgr_tbl.emp_id, a.mgr_id, a1.ename

,mgr_tbl.depth+1

FROM mgr_tbl inner join employee a

on mgr_tbl.mgr_id = a.emp_id

inner join

employee a1

on a.mgr_id = a1.emp_id

---WHERE a.mgr_id = b.emp_id

)

select * from mgr_tbl;

Result will be as follows:

The Depth represents Level of each Manager for employee id 2007 using recursive query

Please like us on Facebook or Google+ if you like our post. Leave your comments below.

Wednesday, 8 April 2015

Constraints in Teradata - With detailed example and work logs and Syntax

Constraints can be grouped as Column level and Table level constraints.

Column constraints apply to single columns as a part of the column definition. Column constraints include:

•CHECK constraint definition clause on a single column

•PRIMARY KEY constraint definition clause on a single column

•REFERENCES constraint definition clause on a single column

•UNIQUE constraint definition clause

Table constraints apply to multiple columns. Table-level constraints include:

•CHECK constraint definition clause on multiple columns

•REFERENCES constraint definition clause on multiple columns

•PRIMARY KEY constraint definition clause on multiple columns

•UNIQUE constraint definition clause on multiple columns

•FOREIGN KEY constraint definition clause

FOREIGN KEY constraint definitions must also specify a REFERENCES clause.

Please go through the below work log to understand more about constraints in Teradata.

/* CONSTRAINTS IN TERADATA */

database financial;

/* The base table should exists and a primary key should be defined already */

/* Only then Foreign key/referential integrity creation will be allowed */

create table custonline_constraint_1

(

cust_id integer not null primary key

) ;

/* Add the integrity constraints on the child table */

drop table constraint_test;

create multiset table constraint_test

(

cust1_id integer not null unique check (cust1_id > 0)

, cust_name varchar(50)

, c_id integer references custonline_constraint_1(cust_id)

,check (char_length(cust_name) > 1)

)

primary index(cust1_id, cust_name);

/* The below insert will give "CHECK CONSTRAINT VIOLATION" as the value we are inserting for cust1_id is 0

* The check constraint allows value which are greater than zero

insert into constraint_test values(0,'prabhu', 11);

/* we will need to have the foreign key value for c_id column in the parent table (custonline_constraint_1)

* if the value doesnt exists, we will get an error "REFERENTAIL CONSTRAINT VIOLATION"

insert into constraint_test values(1,'prabhu', 11);

/* Let us now check the with check option of views

* If any row is inserted in the table which will not be displayed in the view, then those rows are discarded

* STEP-1 will be to insert records in the base table custonline_constraint_1

* STEP-2 will insert the FOREIGN KEY into the table constraint_test

insert into custonline_constraint_1 values(11);

insert into constraint_test values(1,'prabhu', 11);

/* Create the view with check option

* Then try to insert rows having cust_id > 10 using the view

replace view v_constraint_test as select * from constraint_test where cust1_id < 10 with check option;

insert into v_constraint_test values(13,'prabhu', 11);

/* If you insert rows directly into the table, the rows with cust1_id > 10 will be inserted

* The check constraint only restricsts inserts or updates using the view

insert into constraint_test values(13,'prabhu', 11);

/* The below insert statement fails as there is check constraint that requires values of cust_name

* to have length larger than one character

insert into constraint_test values(12,'A', 11);

--- successfully inserts 1 row into the table

select * from v_constraint_test;

Please go through the above queries to understand each concept.

The next blog will contain a few additional implementation of constraints.

Data Analysis Example

Pages