Data Analysis Example: 2016

Sunday, 2 October 2016

All about Greenplum Partitioning - The system catalog tables at your help

I was trying to improve and create ways to identify the partitioning mechanism of the Greenplum tables.

Please see the queries below that will help you find all the partitioning details about a Greenplum table
.
In the next post, we will see how to leverage this information and automate some of your Data Engineering Activities using Python. Finally, we will create a pipeline to move the data to HDFS using Apache Sqoop.

DROP TABLE db_retail.sales ;
CREATE TABLE db_retail.sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( START (date '2008-01-01') INCLUSIVE
END (date '2009-01-01') EXCLUSIVE every (interval '1 mon')
,Start (date '2009-01-01') inclusive
end (date '2010-01-01') exclusive
EVERY (INTERVAL '1 day') );

drop table db_retail.sales_amt;
CREATE TABLE db_retail.sales_amt (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (amt)
( START (1) INCLUSIVE
END (10000) EXCLUSIVE every (10)
);

--- check all the partitions for the table
select * from pg_partitions where schemaname = 'db_retail' and tablename = 'sales';

select * from pg_partitions where schemaname = 'db_retail' and tablename = 'sales_amt';

--- query to analyze each inherited partition table

select partitiontablename from pg_partitions where schemaname = 'db_retail' and tablename = 'sales';

select * from pg_inherits
join pg_class c on c.oid = inhrelid
where inhparent = 'sales'::regclass;

---fetch interval wise max and min ranges for each partition

select min(partitionrangestart),max(partitionrangeend)
, current_date
, partitioneveryclause
, partitiontype
from pg_partitions where schemaname = 'db_retail' and tablename = 'sales'
group by 3,4,5;

select * from information_schema.tables limit 10;

--- recreate the table with the desired partition and column
select column_name, data_type, character_maximum_length
from INFORMATION_SCHEMA.COLUMNS where table_name = 'sales' and table_schema = 'db_retail';

--- find the partition column list
select columnname, position_in_partition_key from pg_partition_columns where schemaname = 'db_retail' and tablename = 'sales';

Sunday, 15 May 2016

HortonWorks Sandbox Hive error - How to resolve the issue

If you have recently downloaded the HDP vm and tried to launch HIVE, you must have encountered the below error:

[root@sandbox ~]# hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12 -1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/spark/lib/spark-assembly -1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar!/org/slf4j/impl/StaticLoggerBin der.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/hadoop/lib/slf4j-log4j12 -1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.2.0-2950/spark/lib/spark-assembly -1.4.1.2.3.2.0-2950-hadoop2.7.1.2.3.2.0-2950.jar!/org/slf4j/impl/StaticLoggerBin der.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-lo g4j.properties
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.securit y.AccessControlException: Permission denied: user=root, access=WRITE, inode="/us er/root":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPe rmissionChecker.java:319)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPe rmissionChecker.java:292)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermi ssion(FSPermissionChecker.java:213)
at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAcc essControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:300)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermi ssion(FSPermissionChecker.java:190)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FS Directory.java:1771)

The reason is the permission of the hdfs folder /user which is drwxr-xr-x

You cannot change it.

Directory permission in HDFS

[root@sandbox ~]# hadoop fs -mkdir /user/root
mkdir: Permission denied: user=root, access=WRITE, inode="/user/root":hdfs:hdfs:drwxr-xr-x
[root@sandbox ~]# who am i
root pts/0 2016-05-15 07:50 (192.168.10.1)
[root@sandbox ~]# hadoop fs -chmod 777 /user
chmod: changing permissions of '/user': Permission denied. user=root is not the owner of inode=user

The solution is to change the user:

Change User and login to Hive - Successful login

Now, you are connected to Hive.

hive> show databases;
OK
default
xademo
Time taken: 1.32 seconds, Fetched: 2 row(s)
hive>

Like us on Facebook and Google if liked the post. We provide Live Support for your project. Just fill the Inquiry form on the home page.

Or click on the link and fill your details: https://www.surveymonkey.com/r/zz962jr

Thursday, 21 April 2016

Basic Operations on Arrays - Python vs. Java

Below , we have listed the difference between array operations between Python and Java.

Python	Java
Python Array Declaration of 2 integers	Java Array Decalaration of 2 integers
>>> arr = [0] * 2 >>> arr[0] 0 >>> arr[0] = 10 >>> arr[1] = 2 >>> arr[2] = 9 Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: list assignment index out of range >>> arr.append(2) # this feature is not available in Java primitive arrays >>>	public class hello_world { public static void main(String[] args) { int[] arr = new int[2]; arr[0] = 10; arr[1] = 9; arr[2] = 5; System.out.println(arr[0] + " " + arr[1]); } } >> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2 at com.dmac.analytics.spark.hello_world.main(hello_world.java:9)

We can append values to an existing array in python.	In Java, a traditional array has to be reassigned to a new array with larger space allocated.

Java also has a library that can use create a list that is mutable.

Read up on java.util.List() to know more about the java data structure.

Please let us know if you like our posts. Like us on Google+ and Facebook.

Thursday, 14 April 2016

Advanced structures - Dictionary of tuples in Python

Today, I want to show you how you can create a dictionary object with a dictionary as key and dictionary as value.

So, it is a Dict of Dict. The final object is stored as a tuple.

Remember the thumb rule that any key for a python dictionary must be hashable. So, we will use a feature called as frozendict in python.

Frozendict is hashable and can be used as a key in Python Dictionary.
You can use the below program effectively for searching for a key in Dictionary of Dictionaries in Python.

Let me know if you need any details. Like us on Google+ and Facebook if you like our posts.

import frozendict as util
from collections import defaultdict

### Create the first entry using hashable frozendict
match = util.frozendict({'switch': 1, 'dstmac': '00:00:00:00:00:01', 'srcmac': '00:00:00:00:00:01'})
match_array = tuple[match,60000,5]
count_bucket2 = dict(CountBucket1 = 140271057099664, CountBucket2 = 140271056501008)

### Create the second entry using hashable frozendict
match_entry = util.frozendict(switch= 5, dstmac= '00:00:00:00:00:00', srcmac= '00:00:00:00:00:01')
match_array1 = tuple([match_entry, 59999, 7])

count_bucket1 = dict(CountBucket1 = 140271056467472, CountBucket2 = 140271056411280)

# Initialize the dictionary of tuples
dict_of_tuples = ({tuple(match_array) : count_bucket2},{tuple(match_array1) : count_bucket1})

####### Your match entry
match_entry = [{'switch': 1, 'dstmac': '00:00:00:00:00:01', 'srcmac': '00:00:00:00:00:01'},60000,5]

#Treating the final structure as a tuple. Each element of the tuple is a #dictionary.
k = 0
while k < len(dict_of_tuples):
key = dict_of_tuples[k].iterkeys()
val = dict_of_tuples[k].itervalues()
if key.next() == tuple(match_entry):
print ('Has key','corresponding value',val.next())
else:
print "Key not present"
k+= 1

Wednesday, 6 April 2016

Binary Tree implementation using Python - Python recursion usage

Below is a complete implementation of BST using Python. A BST(Binary Search Tree has the below properties)

1. Each parent can have max. 2 child
2. Left child is always smaller than parent
3. Right child is always greater than parent.

## Implementation of Binary tree in python
# this is an unbalanced binary tree, so this may lead to skewness in data

class Node():
def __init__(self,val,parent= None):
self.val = val
self.parent = parent
self.left = None
self.right = None
class Tree():
def __init__(self):
self.root = None
def getRoot(self):
return self.root
def add(self,val):
if self.root == None:
self.root = Node(val)
else:
self._add(val,self.root)
def _add(self,val,node):
if val < node.val:
if node.left is not None:
self._add(val,node.left)
else:
node.left = Node(val,node)
else:
if node.right is not None:
self._add(val,node.right)
else:
node.right = Node(val,node)

def find(self, val):
# Search complexity when balanced tree is O(logn)
if self.root is not None:
return self._find(val,self.root)
else:
return None
def _find(self,val,node):
#print val,node.val
if val == node.val:
#print "matched"
return node
elif val < node.val and node.left is not None:
return self._find(val,node.left)
elif val > node.val and node.right is not None:
return self._find(val,node.right)
else:
print val,"value not found"
return None
def deleteTree(self):
self.root = None

def delete_node(self,val):
node = self.find(val)
if node is None:
return None
else:
self._delete_node(val,node)

def _delete_node(self,val,node):
if node == self.root:
self.root = None
else:
# Case 1 when the node is a left node and a right node exists
#if node == node.parent.left and node.left is not None:
if node.left is not None:
replacement = self.find(self.maximum(node.left.val))
node.parent.left = replacement
self.delete_node(replacement.val)
#elif node == node.parent.right and node.right is not None:
elif node.right is not None:
r_val = node.right
print "right node val",r_val.val
print "minimum", self.minimum(r_val.val)
replacement = self.find(self.minimum(node.right.val))
# recursively delete the replacement from original position
self.delete_node(replacement.val)
node.parent.right = replacement
replacement.parent = node.parent
print "new right child",node.parent.right.val

#removing all links for deleted node
node.parent = None
node.left = None
node.right = None
elif node.left is None and node.right is None:
# Case 3 : Its a leaf node
node.parent = None

def printTree(self):
if self.root is not None:
self._printTree(self.root)

# The printTree is a inorder tree walk
def _printTree(self, node):
#print "current node",node.val
if node is not None:
#print "called by",node.val
self._printTree(node.left)
print str(node.val) + ' '
# print "right child of",node.val,node.right.val
self._printTree(node.right)

def minimum(self,val):
node = self.find(val)
#print node.val
#minval = 0
if node is not None:
#print "recursive call"
return self._minimum(val,node)
#print "came back from child process",minval
else:
#"in final else"
return node
def _minimum(self,val,node):
minval = 0
if node.left is not None:
#print "parent node value",node.val, node.left.val
return self._minimum(val,node.left)
else:
#print "In else", "parent node value",node.val
minval = node.val
#print "minval",minval
return minval
#print "outside now",minval
def maximum(self,val):
#print "find max"
node = self.find(val)
if node.right is not None:
return self._maximum(val,node)
else:
#print "no right node",node.val
return node.val
def _maximum(self,val,node):
if node.right is not None:
return self._maximum(val,node.right)
else:
return node.val
def successor(self,val):
node = self.find(val)
if node is None:
return None
elif node is not None and node.right is not None:
return self._minimum(val,node.right)
elif node.right is None:
return self._successor(val,node,node.parent)
def _successor(self,val,node,parent):

if parent is None:
return None
elif parent.left is None :
return self._successor(parent.val,parent,parent.parent)
elif parent.left == node :
#print "matched with left node"
return parent.val
elif parent.left <> node:
return self._successor(parent.val,parent,parent.parent)
def predecessor(self,val):
node = self.find(val)
if node is None:
return None
elif node.left is not None:
return self._maximum(node.left.val,node.left)
elif node.left is None and node <> node.parent.left:
return node.parent.val
else:
return None

#def transplant(T,u,v):

def level_order_traversal(self,key=None):
from copy import deepcopy
if key is None:
node = self.root
else:
node = Node(key)
p_node = [node]
level = [0]
i = 0
for x in p_node:
if x is not None:
if x.parent is not None:
#print "chekcing the traversal",x.val,"parent",x.parent.val
idx = p_node.index(x.parent) + 1
#print "parent info",x.parent.val,idx
nxtlevel = idx + 1
else:
idx = 0
nxtlevel = 1
print "at root", idx, p_node[idx].val
temp = []
if x.left is not None:
#print "left node exists"
p_node.append(x.left)
level.append(nxtlevel)
if x.right is not None:
#print "right node exists","nextlevel",nxtlevel
p_node.append(x.right)
level.append(nxtlevel)
# i keeps track of the element position in p_node
#print "index",i, "level[i]",level[i],"curr node",x.val
print level
try:
if level[i] == level[i+1]:
#print "append"
temp.append(x.val)
else:
temp.append(x.val)
print [element for element in temp]
temp = []
except IndexError:
temp.append(x.val)
print [element for element in temp]
temp = []
print "no more elements"
i += 1
def pre_order_traversal(self):
if self.root is not None:
self._pre_order_traversal(self.root)

# The printTree is a inorder tree walk
def _pre_order_traversal(self, node):
#print "current node",node.val
if node is not None:
#print "called by",node.val
print str(node.val) + ' '
self._pre_order_traversal(node.left)

# print "right child of",node.val,node.right.val
self._pre_order_traversal(node.right)

""""
the graph is as below

3
/ \
0 4
/ \ \
-1 2 8
/ \
1 2.5

"""

tree = Tree()
tree.add(3)
tree.add(4)
tree.add(0)
tree.add(-1)
tree.add(8)
tree.add(2)
tree.add(2.5)
tree.add(1)
tree.printTree()
print tree.root.val
#print (tree.find(-1)).val
print "find 2",(tree.find(2)).parent.val
#print tree.minimum(4)
#print tree.maximum(0)
print "predecessor",tree.predecessor(-1)
tree.delete_node(4)
#tree.printTree()
tree.level_order_traversal()
tree.find(100)
tree.pre_order_traversal()
#print "successor",tree.successor(8)

Tuesday, 5 April 2016

How do you run a scala script in scala command line

You might be using interactive mode of Scala to look at your data. But, if you have written the steps in a dot(.)scala file, then how do you execute. You must know by now that .scala is not mandatory, but its a good practice to follow the convention. We can use the :load option to execute this script from a scala REPL

This is what you do:

1. Write the scala commands in a script CountExample_shell.scala

sc;
val pagecounts = sc.textFile("/home/training/pagecounts/");

// take the first 10 lines and print them
pagecounts.take(10).foreach(println);

pagecounts.count;

// filter only lines that have 'en' (english) for 2nd value of the array
val enPages = pagecounts.filter(_.split(" ")(1) == "en").cache;

enPages.count;

//Create key value pairs in scala
val enTuples = enPages.map(line => line.split(" "));

val enKeyValuePairs = enTuples.map(line => (line(0).substring(0,8) , line(3).toInt));

enKeyValuePairs.reduceByKey(_+_, 1).collect;

enPages.map(l => l.split(" ")).map(l => (l(2), l(3).toInt)).reduceByKey(_+_ , 40).filter( x => x._2 > 200000).map (x => (x._2 , x._1)).collect.foreach(println);

2. Now, to execute the script, use

scala> :load /home/training/CountExample_shell.scala

3. The script will execute and display the below lines.

Loading /home/training/CountExample_shell.scala...
res24: String = /home/training/spark-1.6.0-bin-hadoop2.6/bin/spark-shell
res25: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1fa98a22
pagecounts: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at textFile at <console>:32
20090505-000000 aa Main_Page 2 9980
20090505-000000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 465
20090505-000000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 1 16086
20090505-000000 af.b Tuisblad 1 36236
20090505-000000 af.d Tuisblad 4 189738
20090505-000000 af.q Tuisblad 2 56143
20090505-000000 af Afrika 1 46833
20090505-000000 af Afrikaans 2 53577
20090505-000000 af Australi%C3%AB 1 132432
20090505-000000 af Barack_Obama 1 23368
res27: Long = 1398882
enPages: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at filter at <console>:34
res28: Long = 970545
enTuples: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:36
enKeyValuePairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:38
res29: Array[(String, Int)] = Array((20090507,6175726), (20090505,7076855))
(468159,Special:Search)
(451126,Main_Page)
(1066734,404_error/)

Please post your queries on Scala and let us know your thoughts.

Saturday, 12 March 2016

TPT export script - How to use the TPT export operator and understanding Data Streams

TPT (Teradata Parallel Transporter) is a standard utility that can be used to export data from Teradata system onto a file.

The below diagram explains how TPT operates:

TPT Export in Teradata

Data is generated using the Export operator. Once posted onto the Data Steam, it is consumed by the Consumer.

This is a widely used methodology. It is called Source-Sink in Spring-XD.

Similar architecture is called Publisher-Subscriber for Kafka. Though it is a Message Broker service.

But the ideology remains the same.

Below, I have given a sample TPT Export script for your reference. As you see , once the operators are created, the producer is applied to the consumer.

/*************************************+

* Script Name - export_data.tpt

* Usage - tbuild -f export_data.tpt

* Author - Tduser

* Description - exports data from

* retail.item to

* export_tpt_item.txt file

* Version - Relase 1 - 12-Mar-2016

**************************************/

Define Job export_to_file

description 'export records from item table'

(

define schema item_schema

(

item_id char(30),

country_code char(30)

);

/* Consumer operator since in an export job the final target is the file */

define operator file_writer

type dataconnector consumer

schema item_schema

attributes

(

varchar filename = 'export_tpt_item.txt',

varchar format = 'text',

varchar openmode = 'write',

varchar indicatormode='N'

);

/* Producer operator reads data from a teradata table */

define operator export_item

type export

schema item_schema

attributes

(

varchar username = 'dbc',

varchar userpassword = 'dbc',

varchar tdpid = '127.0.0.1',

integer maxsessions = 8,

varchar selectstmt = '

select cast(l_orderkey as char(30)),

cast(l_partkey as char(30))

from retail.item;');

/* apply the producer output to the consumer */

apply to operator(file_writer[1])

select item_id,country_code

from operator(export_item[1]);

);

Execution Command: Use a wrapper to call the below script:

$> tbuild -f export_script.tpt | tee export_script.tpt.log

Log entries will be as follows:

Teradata Parallel Transporter Version 15.00.00.02

Job log: /opt/teradata/client/15.00/tbuild/logs/root-38.out

Job id is root-38, running on TDExpress150001_Sles10

Teradata Parallel Transporter file_writer[1]: TPT19006 Version 15.00.00.02

Teradata Parallel Transporter SQL Selector Operator Version 15.00.00.02

export_item: private log not specified

file_writer[1]: TPT19010 Instance 1 directing private log report to 'dtacop-root-1447-1'.

file_writer[1]: TPT19007 DataConnector Consumer operator Instances: 1

file_writer[1]: TPT19003 ECI operator ID: 'file_writer-1447'

file_writer[1]: TPT19222 Operator instance 1 processing file 'export_tpt_item.txt'.

export_item: connecting sessions

export_item: sending SELECT request

export_item: retrieving data

export_item: Total Rows Exported: 60175

export_item: finished retrieving data

file_writer[1]: TPT19221 Total files processed: 1.

export_item: disconnecting sessions

export_item: Total processor time used = '0.07 Second(s)'

export_item: Start : Sat Mar 12 11:05:33 2016

export_item: End : Sat Mar 12 11:05:38 2016

Job step MAIN_STEP completed successfully

Job root completed successfully

Job start: Sat Mar 12 11:05:27 2016

Job end: Sat Mar 12 11:05:38 2016

Reach out to me if you have any doubts.

Special Offer, register for the online 7-video course just for USD 15.