Wednesday 21 October 2015

Minimum or Min function in Hive - Use of min() over(partition by ) in Hiveql

 Say , we have this table products in Hive.
+----------+-------+-------+
|   name   | price | notes |
+----------+-------+-------+
| product1 |   100 |       |
| product1 |   200 | note1 |
| product2 |    10 | note2 |
| product2 |     5 | note2 |
+----------+-------+-------+
and I expect to get this result (distinct of products with minimum price)
+----------+-------+-------+
|   name   | price | notes |
+----------+-------+-------+
| product1 |   100 |       |
| product2 |     5 | note2 |
+----------+-------+-------+
How do we go about it:

1. The subquery approach:

select a.name,a.price,b.notes from (select name,min(price) as price from products group by name) as a inner join (select name,price,notes from products) as b on a.name = b.name and a.price = b.price;

2. The alternate and better approach is to use window partitioning:

select name,price,notes from (select *, min(price)over(partition by name) as min_price from products) as 
where a.price = a.min_price;


The 2nd approach is better as it invokes a single-map task to read the table while the first approach will invoke 2 map tasks even on a single node system.