Wednesday, 21 October 2015

Minimum or Min function in Hive - Use of min() over(partition by ) in Hiveql

Say , we have this table products in Hive.

+----------+-------+-------+
|   name   | price | notes |
+----------+-------+-------+
| product1 |   100 |       |
| product1 |   200 | note1 |
| product2 |    10 | note2 |
| product2 |     5 | note2 |
+----------+-------+-------+

and I expect to get this result (distinct of products with minimum price)

+----------+-------+-------+
|   name   | price | notes |
+----------+-------+-------+
| product1 |   100 |       |
| product2 |     5 | note2 |
+----------+-------+-------+

How do we go about it:

1. The subquery approach:

select a.name,a.price,b.notes from (select name,min(price) as price from products group by name) as a inner join (select name,price,notes from products) as b on a.name = b.name and a.price = b.price;

2. The alternate and better approach is to use window partitioning:

select name,price,notes from (select *, min(price)over(partition by name) as min_price from products) as
where a.price = a.min_price;

The 2nd approach is better as it invokes a single-map task to read the table while the first approach will invoke 2 map tasks even on a single node system.

Friday, 25 September 2015

Hortonworks HIVE metastore path - find the HDP Hive path to check for database.db files

Since, the direction is towards Open Data Platform, we are using Hortonworks hadoop for our project.

The metastore path in the HDP box is slightly different.

Lets find out the path using the below commands:

[root@sandbox /]# cd /etc/hive

[root@sandbox hive]# ls
2.3.0.0-2557 conf conf.install

[root@sandbox hive]# cd conf.install

Open hive-site.xml and search for the warehouse directory:

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/apps/hive/warehouse</value>
</property>

Once we get that, next step search the path using `hadoop fs` command. All the databases will have a separated directory of the form <databasename>.db

[root@sandbox conf.install]# hadoop fs -ls /apps/hive/warehouse/
Found 5 items
drwxrwxrwx - root hdfs 0 2015-09-25 06:08 /apps/hive/warehouse/employees
drwxrwxrwx - root hdfs 0 2015-09-15 07:07 /apps/hive/warehouse/financials.db
drwxrwxrwx - hive hdfs 0 2015-08-20 09:05 /apps/hive/warehouse/sample_07
drwxrwxrwx - hive hdfs 0 2015-08-20 09:05 /apps/hive/warehouse/sample_08
drwxrwxrwx - hive hdfs 0 2015-08-20 08:58 /apps/hive/warehouse/xademo.db

You can read the data for the employees table using the below command:

[root@sandbox conf.install]# hadoop fs -cat /apps/hive/warehouse/employees/employees.txt

Execution of the below steps via an example diagram:

hive metastore in hortonworks hadoop

Dirty Method to find the location of metastore is to use the describe extended command:

hive> describe extended employees;
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>

Detailed Table Information Table(tableName:employees, dbName:default, owner:root, createTime:1443161279, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:salary, type:float, comment:null), FieldSchema(name:subordinates, type:array<string>, comment:null), FieldSchema(name:deductions, type:map<string,float>, comment:null), FieldSchema(name:address, type:struct<street:string,city:string,state:string,zip:int>, comment:null)], location:hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/employees, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{colelction.delim=, mapkey.delim=, serialization.format=, line.delim=
, field.delim=}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=1, transient_lastDdlTime=1443161299, COLUMN_STATS_ACCURATE=true, totalSize=185}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.668 seconds, Fetched: 8 row(s)

Friday, 11 September 2015

Informatica Training in Bangalore - Classroom and Online in Marathalli Bangalore

Informatica Online Training Course

Data Warehouse Concepts:

Introduction to Data warehouse
What is Data warehouse and why we need Data warehouse
Dimensional modeling
Star schema/Snowflake schema/Galaxy schema
Dimensions / Facts tables.
Slowly Changing Dimensions and its types.
Data Staging Area
Different types of Dimensions and Facts.
Data Mart vs Data warehouse

Informatica Power Center 9:

Software Installation:

Informatica 9 Server/Client Installation on Windows/Unix.

Power Center Architecture and Components:

Introduction to Informatica Power Center
Difference Between Power Center and Power Mart
PowerCenter 9 architecture
PowerCenter 7 architecture vs Power Center 8 and 9 architecture
Extraction, Transformation and loading process

Power Center tools: Designer, Workflow manager, Workflow Monitor,

Repository Manager, Informatica Administration Console.

Repository Server and agent
Repository maintenance
Repository Server Administration Console
Security, Repository, privileges and folder permissions
Metadata extensions

Power Center Developer Topics:

Lab 1- Create a Folder.

How to provide Privileges
Source Object Definitions

Source types

-       Relational Tables (Oracle, Teradata)
–       Flat Files (fix width, Delimiter Files)
–       Xml Files
–       COBOL Files
–       Sales Force

Source properties

Lab 2- Analyze Source Data, Import Source.

Target Object Definitions

- Target types
– Target properties

Lab 3- Import Targets

Transformation Concepts
Transformation types and views
Transformation features and ports
Informatica functions and data types

Mappings

Mapping components
Source Qualifier transformation
SQL and Post SQL
Mapping validation
Data flow rules

Lab 4 –Create a Mapping, session, and workflow

Workflows

Workflow Tools
Workflow Structure and configuration
Workflow Tasks
Workflow Design and properties

Session Tasks

Session Task properties
Session components
Transformation overrides
Session partitions

Workflow Monitoring

Workflow Monitor views
Monitoring a Server
Actions initiated from the workflow Monitor
chart View and Task view.

Lab 6 – Start and Monitor a Workflow

Debugger

Debugger features
Debugger windows
Tips for using the Debugger

Lab 7 –The Debugger

Expression transformation

Expression, variable ports, storing previous record values.

Different type of Ports

Input/ output / Variable ports and Port Evaluation
Filter transformation
Filter properties

Lab 8- Expression and Filter

Aggregator transformation

Aggregation function and expressions
Aggregator properties
Using sorted data
Incremental Aggregation

Joiner transformation

Joiner types
Joiner conditions and properties
Joiner usage and Nested joins

Lab 9 – Aggregator, Heterogeneous join

Working with Flat files
Importing and editing flat file sources & Targets

Lab Session – Use Flat file as source.

Sorter transformation

Sorter properties
Sorter limitations

Lab 10 – Sorter

Propagate Attributes.
Shared Folder and Working with shortcuts.
Informatica built in functions.

Lookup transformation

Lookup principles
Lookup properties
Lookup techniques
Connected and unconnected lookup.
Lookup Caches

Lab 11 – Basic and Advance Lookup

Target options

Row type indicators
Row loading operations
Constraint- based loading
Rejected row handling options

Lab 12 – Deleting Rows

Update Strategy transformation
Update strategy expressions

Lab 13 – Data Driven Inserts and Rejects

Router transformation

Using a router

Router groups

Lab 14 – Router

Conditional Lookup

Usage and techniques

Advantage

Functionality

Lab 15 – Straight Load
Lab 16 – Conditional Lookup

Heterogeneous Targets

Heterogeneous target types
Target type conversions and limitations

Lab 17 – Heterogeneous Targets

M-applet

Functionality and Advantages
M-applet types and structure
M-applet limitations

Lab 18 – M-applet

Reusable transformations
Advantages
Limitations
Promoting and copying transformations

Lab 19 – Reusable transformations

Sequence Generator transformation
Using a sequence Generator

Sequence Generator properties

Dynamic Lookup
Dynamic Lookup theory
Usage and functionality
Advantages

Lab 20 – Dynamic Lookup

Concurrent and sequential WorkflowsStopping, Starting and suspending tasks and workflows
- Concurrent Workflows
- Sequential Workflows

Lab 21 – Sequential Workflow

Additional TransformationsLab Sessions- For above transformations
- Union Transformation
- Rank transformation
- Normalize transformation
- Custom Transformation
- Transformation Control transformation
- XML Transformation
- SQL Transformation
- Stored Procedure Transformation
- External procedure Transformation
- SQL Transformation
Error Handling
Overview of Error Handling Topics

Lab 22 – Error handling fatal and non Fatal

Workflow Tasks:
- Command
- Email
- Decision
- Timer
- Control
- Even Raise and Wait
- Sequential Batch Processing

Parallel Batch Processing

Lab Sessions – With Workflow tasks
Link Conditions
Team Based Development

Version Control

Checking out and checking in objects.

Performance Tuning
Overview of System Environment
Identifying Bottlenecks.

Optimizing Source, Target, mapping, Transformation, session.

Mapping Parameters and Variables

Introduction to Mapping Variables and Parameters

Creating Mapping Variables and Updating Variables
Creating Parameter File and associating file to a Session
System Variables
Variables functions

Lab 26 – Override Mapping Variable with Parameter Files
Lab 27 – Dynamically Updating a Source Qualifier with Mapping Variable

Slowly Changing Dimensions Type 1, Type 2, Type 3
Incremental Loading

Lab 28– SCD 1, 2, 3

Reusable Workflow Tasks
Work Lets
Work lets Limitation
Sessions
Reusable Sessions

Lab 29 – Create Worklet using Tasks

Command Line Interface
Overview of PM REP and functions.

PM REP
Informatica Migrations:

Copying Objects

Objects export and import (XML)
Deployment groups

Workflows Scheduling:
Using Informatica
Unix cron tab, third party tools.

Lab 30: Informatica Project- Case Study

Sales Data mart.
Loading Dimensions and Facts.

ETL Best Practices and methodologies
Review the Industry best practices in ETL Development
Review Real time project experiences of trainer
Discuss what is learned techniques are useful in real world
How to design effective ETL process
Important considerations in designing ETL process
Discuss real world production issues and support
Discuss various roles in ETL world
Business Analyst, System Analyst
System Architect
Technical Architect, ETL Lead
Stakeholders, Business users
Effective ways of using Data warehouse
Review various BI Reporting methods
- Q/A-Interview preparation/Placements
- Answer students questions
- Tips for interview preparation
- How we can assist in placement and future growth
- Discuss other related technologies like Business Intelligence (BI)
- Advancing career options

Pages

Wednesday, 21 October 2015

Minimum or Min function in Hive - Use of min() over(partition by ) in Hiveql

Friday, 25 September 2015

Hortonworks HIVE metastore path - find the HDP Hive path to check for database.db files

Friday, 11 September 2015

Informatica Training in Bangalore - Classroom and Online in Marathalli Bangalore

Informatica Online Training Course

Data Warehouse Concepts:

Introduction to Data warehouse

What is Data warehouse and why we need Data warehouse

Dimensional modeling

Star schema/Snowflake schema/Galaxy schema

Dimensions / Facts tables.

Slowly Changing Dimensions and its types.

Data Staging Area

Different types of Dimensions and Facts.

Data Mart vs Data warehouse

Informatica Power Center 9:

Software Installation:

Informatica 9 Server/Client Installation on Windows/Unix.

Power Center Architecture and Components:

Introduction to Informatica Power Center

Difference Between Power Center and Power Mart

PowerCenter 9 architecture

PowerCenter 7 architecture vs Power Center 8 and 9 architecture

Extraction, Transformation and loading process

Power Center tools: Designer, Workflow manager, Workflow Monitor,

Repository Manager, Informatica Administration Console.

Repository Server and agent

Repository maintenance

Repository Server Administration Console

Security, Repository, privileges and folder permissions

Metadata extensions

Power Center Developer Topics:

Lab 1- Create a Folder.

How to provide Privileges

Source Object Definitions

Source types

- Relational Tables (Oracle, Teradata)– Flat Files (fix width, Delimiter Files)– Xml Files– COBOL Files– Sales Force

Source properties

Lab 2- Analyze Source Data, Import Source.

Target Object Definitions

- Target types– Target properties

Lab 3- Import Targets

Transformation Concepts

Transformation types and views

Transformation features and ports

Informatica functions and data types

Mappings

Mapping components

Source Qualifier transformation

SQL and Post SQL

Mapping validation

Data flow rules

Lab 4 –Create a Mapping, session, and workflow

Workflows

Workflow Tools

Workflow Structure and configuration

Workflow Tasks

Workflow Design and properties

Session Tasks

Session Task properties

Session components

Transformation overrides

Session partitions

Workflow Monitoring

Workflow Monitor views

Monitoring a Server

Actions initiated from the workflow Monitor

chart View and Task view.

Lab 6 – Start and Monitor a Workflow

Debugger

Debugger features

Debugger windows

Tips for using the Debugger

Lab 7 –The Debugger

Expression transformation

Expression, variable ports, storing previous record values.

Different type of Ports

- Relational Tables (Oracle, Teradata)
– Flat Files (fix width, Delimiter Files)
– Xml Files
– COBOL Files
– Sales Force

- Target types
– Target properties

Lab 15 – Straight Load
Lab 16 – Conditional Lookup