Data Analysis Example: 2014

Tuesday 23 December 2014

MySQL function to reject invalid email address from websites

Most of us data analysts are facing the challenge of accepting valid data and rejecting invalid data. Please see the below email address validation function in MySql.

The function returns 0 for invalid and 1 for valid emails.

Example:

DELIMITER $$
CREATE FUNCTION fn_email_test(column_value Varchar(200)) RETURNS tinyint(1)
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE is_valid_email BOOLEAN;

SET is_valid_email = 0;

IF column_value NOT REGEXP '^[\x41-\x5a\x61-\x7a\x30-\x39\x2e\x2d\x5f\x25]+\x40[\x41-\x5a\x61-\x7a\x30-\x39\x2e\x2d]+\x2e[\x41-\x5a\x61-\x7a]{2,4}$';
then
SET is_valid_email = 0;
else
set is_valid_email = 1;
END IF;
RETURN is_valid_email;
END$$
DELIMITER ;

Please write your comments to help us improve. Like us on Google+ or Facebook.

Sunday 21 December 2014

Understanding Implicit Commits in MySql on using Alter table Create Index on temporary table

As we understand from the MySql documents, implicit commits happen when we do a Alter Table or Add Index query within a stored procedure.

So, this means that any DDL changes will be committed to the Database. So, if you do a Alter Table Add Column, the column will be added implicitly. A Rollback will not change the table structure back to its old DDL. The new column or index will be committed.

Any DML not committed since the last successful commit will be committed into the Database as well.

As per the MySQL documentation:

CREATE TABLE and DROP TABLE statements do not commit a transaction if the TEMPORARY keyword is used. (This does not apply to other operations on temporary tables such as ALTER TABLE and CREATE INDEX, which do cause a commit.) However, although no implicit commit occurs, neither can the statement be rolled back, which means that the use of such statements causes transactional atomicity to be violated. For example, if you useCREATE TEMPORARY TABLE and then roll back the transaction, the table remains in existence.

Example Procedure:

drop procedure if exists testproc_2;
DELIMITER $$
CREATE PROCEDURE testproc_2(user_enter varchar(500))
BEGIN

DECLARE v_error bigint;
DECLARE v_file_type, error_msg char(100);

DECLARE EXIT HANDLER FOR sqlexception
BEGIN
SELECT error_msg;
ROLLBACK;
end;

set v_error = 0;

START TRANSACTION;

select user_enter;
insert into int_marks
values (user_enter);

select * from int_marks;

drop temporary table if exists temp_int_marks;
CREATE temporary table temp_int_marks select * from int_marks where 1=2;
ALTER TABLE temp_int_marks ADD INDEX tim_index (member_first_name(20));

select * from int_marks;
select 'Executing';

set error_msg = 'sql failure';
create temporary table temp_int_marks select * from int_marks where 1=2;

END$$
DELIMITER ;

Verifying the DDL:

show create table temp_int_marks;

CREATE TEMPORARY TABLE `temp_int_marks` ( `member_first_name` varchar(50) COLLATE utf8_unicode_ci NOT NULL COMMENT 'Member First Name', KEY `tim_index` (`member_first_name`(20)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Before the procedure the data in the int_marks table:

Call the procedure:

call testproc_2('none');

From within the procedure:

The 'none' value passed to the procedure for insertion is shown. On, any sql failure, this should be rolled back and data not committed into the table.

Outside the procedure:

So, the data has been committed to the DB though there was no explicit commit.
The index addition, being an implicit commit is applied to the Database. This also commits any DML that preceded the commit.

We validated the same by opening a new session. The new row has been added to the Database.

Please let us know if this helps. Like us on Google Plus or Facebook.

Sunday 14 December 2014

RAID architecture of Teradata

In this post, we will discuss the space requirements of RAID 1 and RAID 5 architecture.

RAID-1 follows the concept of mirroring where data in one disk is also replicated on another disk.
This provides as an ideal fallback option as the entire data is available in another disk. Down time for any disk failure is minimal and so is data loss.
The only drawback is space. RAID-1 implementation would need double the space.

RAID-5 architecture stores the information about each row in a parity bit. So, out of say 4 disks, 1 disk is used up for parity bit. But, only 25% space is used as against RAID-1.

Wednesday 10 December 2014

Downloading data from a website using urllib2 library of python

As many of you would be integrating web logs into your data warehouse or analytics platforms, the urllib2 library of python is here to help.

Easy to use library (should i say pylike) , this is the answer for data extraction from the websites. There is a json library also, which we will discuss in the next posts.

The below example code, sets the base address of the website (this you need to manually identify). The counter increments the page numbers , as you read each page, starting from the home page.

Please let us know your experiences and learning.

@author: admin
"""
import urllib2
import re

count = 100
limit = 105

f1 = open("C:\\Users\\admin\\Documents\\Federal_1.txt","wb")
f1.close

for count in range(count,limit):
static_add = "https://www.federalregister.gov/articles/text/raw_text/201/423/" + str(count) + ".txt"

try:
page = urllib2.urlopen(static_add)
print "Successfully connected to page"
html = page.read()
f3 = open("C:\\Users\\admin\\Documents\\Federal_1.txt","a")
f3.write(static_add + '\n' + '\n')
f3.write(html)
f3.close
except urllib2.HTTPError, err:
if err.code == 404:
print "Page not found!"
elif err.code == 403:
print "Access denied!"
else:
print "Something happened! Error code", err.code
except urllib2.URLError, err:
print "Some other error happened:", err.reason

count +=1

The files created will contain the raw text of the website.
You can then implement a bagging algorithm to bag words etc.

Wednesday 12 November 2014

Problems in transforming date using str_to_date function - null returned for valid date input

The str_to_date() fails by returning a null value in MySql for some specific valid date values. Unlike Teradata, where we can use the cast function to format dates also, we were using the function to format our dates in MySql.
To overcome the bug with str_to_date() function, we can use the date_format() function.

Did a comparison of the values returned by both the functions when the input is a valid date and the results are below:

Query:
select str_to_date(trim('2013-09-10 02:30:01 '), '%Y-%m-%d %H:%m:%s'), date_format(trim('2013-09-10 02:30:01 '), '%Y-%m-%d %H:%m:%s');

str_to_date() returns a null, where date_format() returns a value.

Please share with us your comments or suggestion.

Wednesday 15 October 2014

Accept or reject International Accent marks in MySql and comparing them using regex

After fighting over this for the last couple of hours, I have nailed it at last. Storage and filtering international accent marks in MySql , one of the most sought after problems for any portal's underlying DB.

So, my approach was to first pass the column value to the function and store as a BLOB. Please note that if you store it as Varchar, your regular expressions will not work.

Accent marks span over multiple bytes, refer the collation and charset pages on mysql manuals.

Example:

select char_length('À'); # 1 as the total number of characters
select length('À'); #2 as the total number of bytes.

Now, the example function to reject any special characters and number and to accept all alphabets and international accent marks.

Example function:

drop function if exists fn_alphabetic_with_acc_hypen;
DELIMITER $$
CREATE FUNCTION fn_alphabetic_with_acc_hypen(column_value Varchar(200)) RETURNS tinyint(1)
BEGIN
DECLARE is_alpha_acc_hyphen BOOLEAN;

SET is_alpha_acc_hyphen = 0;

# If input string contains only numbers, then return 1 else 0
#using concat function to preserve the hexadecimal codes

IF column_value not REGEXP concat('[',x'21','-',x'2C',x'2E','-',x'40',x'5B','-',x'60',x'7B','-',x'7E',x'A1','-',x'BF',']') then
SET is_alpha_acc_hyphen = 1;
# 1 means true
else
set is_alpha_acc_hyphen = 0;
# 0 means false
END IF;
RETURN is_alpha_acc_hyphen;
END$$
DELIMITER ;

If we run the below select, it will return 1 for international accent mark containing strings.

select fn_alphabetic_with_acc_hyphen('A Àcb'); # results in 1 (TRUE) - that it accepts the international accent marks

Please let us know your suggestion/advice. Like us if you were helped.

Friday 26 September 2014

User defined functions in MySql - check date format and leap year in MySql UDF

The below function can be used to check the date format (MMDDYYYY) in any input column. Also, it checks for a valid date. So, if for example, there is a date 02302014 (30th Feb, 2014), it will return 0.

0 = False (Not a valid date)
1 = True (Is a valid date)

While compiling, please change the DEFINER=`dataanalysis` to a valid user in your system. Else, you will get a user not present definer error.

The error message will be:
0 row(s) affected, 1 warning(s): 1449 The user specified as a definer ('dataanalysis'@'%') does not exist

Example code for Date validation:

DELIMITER $$
CREATE FUNCTION fn_str_date_chk(column_value Varchar(200)) RETURNS tinyint(1)
BEGIN
DECLARE is_valid BOOLEAN default 0;
#declare continue handler for sqlwarning

if char_length(column_value) <> 8 then
set is_valid = 0;
elseif char_length(column_value) = 8 then

case when date(str_to_date(column_value,'%m%d%Y')) is null then
set is_valid = 0;
else set is_valid = 1;
end case;
else
set is_valid = 1;
end if;

return is_valid;
end$$
DELIMITER ;

Usage:

select fn_is_date(date_of_birth) from employee;

Result:

0 = for all invalid date of birth
1 = for all valid date of birth

Please like us if you appreciate our blog. And make sure to chip in with your suggestions.

Saturday 30 August 2014

Sample procedure in MySql with the use of a out parameter

Starting with the development of new operational data store, with integration of the portal.

The most handy feature is MySql procedure alongwith out parameters to pass the values on to a function or the front-end.

Example:

drop procedure rec_cnt;

delimiter //

create procedure rec_cnt(out param1 int)

begin

select count(*) into param1 from employee;

end//

delimiter ;

call rec_cnt(@param1);

select @param1;

Results:

20097 rows returned

Please like us on Google if you like our posts!! And feel free to leave your comments/suggestion.

Thursday 24 July 2014

Using Group_Concat() function in MySQL - Handy function to extract all values of a column

The Group_Concat() function in MySQL saved the day for me this week. So, a post thanking the authors. :)

I was loading some basic geographical data of Users who logged in to a website. The data looked as follows:

Row_id	Language	Name
42914	ES	Reader1
42915	ES	Reader1
44623	EN	Reader1
44624	EN	Reader1
44625	EN	Reader1

The dataset is actually pretty heavy with lots of other columns. So, my objective was to have a single row for 'NAME' with all the languages he uses in a delimited format.

For example, the result output should become:

Name	Languages
Reader1	ES \| EN
Reader2	EN \| AR

And the group_concat() function in MySQL was just what I was looking for.

Example query:

select

Name

, group_concat(distinct actor_languages separator '|')

from tweet_data

group by Name

;

The query will group your dataset by Name, and create a pipe separated string of all the languages in the language column.

Please remember to use distinct. Else, we will have one entry for each row of Name, and this will become a really long string.

Suggestions or comments are welcome as always !!

Sunday 20 July 2014

Random Forest implementation using Python Sklearn library

Python is wonderful programming language. The more I explore, the more I am amazed about the power of this language.

Below is a recent Random forest implementation using Pandas and Sklearn libraries.

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is Random forest implementation.
"""

import numpy as np
import csv as csv
import scipy as sp
import pandas as pandas
import matplotlib as mpl
import re

print "import successful"

csv_object = csv.reader(open("C:\\train.csv"), delimiter = ",")

# skipping the header
header = csv_object.next()

data = []
count = 0
for row in csv_object:
count +=1
data.append(row)
data = np.array(data)

#print data[0:15,5]
#print type(data[0::,5])

# Importing the pandas module to replace missing values

df = pd.read_csv("C:\\train.csv", header = 0)

# Extracting Salutation from Names

Salutation = ['Rev.', 'Mrs.', 'Miss.', 'Master.', 'Mr.']
#Salutation = ['Rev.']

df['Salutation'] = 100

pattern = ''

for i in xrange(0,len(Salutation)):
pattern = Salutation[i]
print "pattern is",pattern
# print df[df['Name'].str.contains(pattern)]['Salutation'].replace(False, i)
#df['Salutation']= df['Name'].str.contains(pattern).replace('False', i)
df.loc[(df['Name'].str.contains(pattern)), 'Salutation'] = i
#print pattern, df['Salutation']
print df.head(2)

print df[(df.Salutation != 100)][['Name', 'Salutation']]

df['Gender'] = df.Sex.map({'female':0 , 'male' : 1}). astype(int)
#print df[(df.Gender == 1) & (df.Pclass == 1)]

# Replace the ages in the dataset

median_ages = np.zeros((2,3))

for i in range(0,2):
for j in range(0,3):
median_ages[i,j] = df[(df.Gender == i) & (df.Pclass == j+1)]['Age'].dropna().median()

#print median_ages

df['AgeFill'] = df['Age']
#print df[df.Age.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex', 'Survived']]

for i in range(0,2):
for j in range(0,3):
df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1) , 'AgeFill'] = median_ages[i,j]

df['EmbarkedCode'] = df.Embarked.map({'C':0, 'S' : 1, 'Q':2})

#print df[df.Fare.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex', 'Survived']]
#print df[df.Age.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex', 'Survived', 'EmbarkedCode']]

#Convert the integer dataframe into numpy arraay

df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis = 1)
#for i in range(0,3):
# print i, len(df[(df.EmbarkedCode == i)])

df.loc[df.EmbarkedCode.isnull()] = 1
#print df[df.EmbarkedCode.isnull()][['PassengerId']]
#print df.head(1)
#df = df.reindex(columns = pandas.Index(['Survived']).append(df.columns - ['Survived']))
df = df[['Survived', 'PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Gender', 'AgeFill', 'EmbarkedCode', 'Salutation']]
print df.head(2)
cols = df.columns.tolist()

train_data = df.values
print "Train data sample"
#print train_data[0::,1::]
#print train_data[0::,0]

# Start with Random Forest implementation

from sklearn.ensemble import RandomForestClassifier

# Initiate the random forest object

forest = RandomForestClassifier(n_estimators = 50)

# Fit the model
print "Fit the model"
forest = forest.fit(train_data[0::,1::],train_data[0::,0])

#Import the test data and run on the forest
testdata = pd.read_csv("C:\\test.csv",header = 0)

# Extracting the salutation from Name

Salutation = ['Rev.', 'Mrs.', 'Miss.', 'Master.', 'Mr.']
#Salutation = ['Rev.']

testdata['Salutation'] = 100

pattern = ''

for i in xrange(0,len(Salutation)):
pattern = Salutation[i]
print "pattern is",pattern
# print df[df['Name'].str.contains(pattern)]['Salutation'].replace(False, i)
#df['Salutation']= df['Name'].str.contains(pattern).replace('False', i)
testdata.loc[(testdata['Name'].str.contains(pattern)), 'Salutation'] = i
#print pattern, df['Salutation']
print testdata.head(2)

print testdata[(testdata.Salutation != 100)][['Name', 'Salutation']]

for i in range(1,4):
print "Class wise median Fare", i, testdata[(testdata.Pclass == i) & (testdata.Embarked == 'S')]['Fare'].dropna().median()

testdata['Gender'] = testdata.Sex.map({'female':0 , 'male' : 1}). astype(int)

median_ages = np.zeros((2,3))

for i in range(0,2):
for j in range(0,3):
median_ages[i,j] = testdata[(testdata.Gender == i) & (testdata.Pclass == j+1)]['Age'].dropna().median()

#print median_ages

testdata['AgeFill'] = testdata['Age']
#print testdata[testdata.Age.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex']]

for i in range(0,2):
for j in range(0,3):
testdata.loc[(testdata.Age.isnull()) & (testdata.Gender == i) & (testdata.Pclass == j+1) , 'AgeFill'] = median_ages[i,j]

testdata['EmbarkedCode'] = testdata.Embarked.map({'C':0, 'S' : 1, 'Q':2})

testdata = testdata.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis = 1)

testdata.loc[(testdata.EmbarkedCode.isnull()), 'EmbarkedCode'] = 1
testdata.loc[(testdata.Fare.isnull()), 'Fare'] = 8.05

test_data = testdata.values[0:]

output = forest.predict(test_data)
print forest
#print output[12]
#print test_data[152,0]

Final_output = np.zeros((418,2))

for i in range(0,418):
j = 0
str = test_data[i,0]
str.flatten()

Final_output[i,j] = str
for j in range (1,2):

output1 = output[i]
output1.flatten()

Final_output[i,j] = output1

np.savetxt("C:\\test_rf.csv",Final_output,delimiter = ",")

Please let us know if you want to discuss more about any advanced analytics problem.

Monday 7 July 2014

Full outer join sample in MySQL - Replace a string in the dataset

Below is a sample query for doing a full outer join in MySQL:

The data joined from output_1 , output_2, output_3 and output_4 tables based on actor_link column.

We are also using the Replace() function of MySQL to replace a string in the resultant dataset.

Example Query:

select Replace(coalesce(out1.actor_link, out2.actor_link, out3.actor_link, out4.actor_link), 'http://www.twitter.com/',''), coalesce(cnt1,0)+ coalesce(cnt2,0) + coalesce(cnt3,0) + coalesce(cnt4, 0)

from

(select actor_link, count(gen_key) as cnt1 from output_1 group by actor_link) out1

left outer join

(select actor_link, count(gen_key) as cnt2 from output_2 group by actor_link) out2

on out1.actor_link = out2.actor_link

left outer join

(select actor_link , count(gen_key) as cnt3 from output_3 group by actor_link) out3

on out1.actor_link = out3.actor_link

left outer join

(select actor_link , count(gen_key) as cnt4 from output_4 group by actor_link) out4

on out1.actor_link = out4.actor_link

UNION

from

(select actor_link, count(gen_key) as cnt1 from output_1 group by actor_link) out1

right outer join

(select actor_link, count(gen_key) as cnt2 from output_2 group by actor_link) out2

on out1.actor_link = out2.actor_link

right outer join

(select actor_link , count(gen_key) as cnt3 from output_3 group by actor_link) out3

on out1.actor_link = out3.actor_link

right outer join

(select actor_link , count(gen_key) as cnt4 from output_4 group by actor_link) out4

on out1.actor_link = out4.actor_link

where out1.actor_link <> 'actor__link'

;

Please let us know if you need any help in your Analytics exploits.

Thursday 26 June 2014

Loading a text file into MySQL and joining it with a table - Line delimiter may cause an issue

A post after a long time :)

Today, we will learn how to load a data file or text file directly into a MySQL table.

Step1: We create a table in using MySQL workbench

drop table response_values;

create table response_values(
respondent_serial varchar(10),
response_ID decimal(16,0),
D13new varchar(255),
Load_Date varchar(1),
primary key (respondent_serial)
);

Step 2: We run the below command to load the file into the table

load data local infile 'C:\\Data files\\Raw_Data_2013.txt' into table response_values
fields terminated by '|'
lines terminated by '\n'
(respondent_serial, response_ID,D13new, Load_date);

Step 3 : We now join the data of this file with another table using a Left Outer Join. MySQL follows the standard format of SQL.

Tips: As a best practice, don't keep the joining columns at the very end of the text file. If that is unavoidable, append an additional technical column as delimiter.

Example :

select RV.*, TD.*
from response_values as Rv left outer join TotFeed_Data as Td
on upper(trim(RV.D13new)) = upper(trim(td.actor_link) )
where d13new is not null
;

The Upper() function in MySQL converts the string into Upper-case.

Feel free to send any questions. Like us if you are helped.

Data Analysis Example

Pages

Tuesday 23 December 2014

MySQL function to reject invalid email address from websites

Sunday 21 December 2014

Understanding Implicit Commits in MySql on using Alter table Create Index on temporary table

Sunday 14 December 2014

RAID architecture of Teradata

Wednesday 10 December 2014

Downloading data from a website using urllib2 library of python

Wednesday 12 November 2014

Problems in transforming date using str_to_date function - null returned for valid date input

Wednesday 15 October 2014

Accept or reject International Accent marks in MySql and comparing them using regex

Friday 26 September 2014

User defined functions in MySql - check date format and leap year in MySql UDF

Saturday 30 August 2014

Sample procedure in MySql with the use of a out parameter

Thursday 24 July 2014

Using Group_Concat() function in MySQL - Handy function to extract all values of a column

Sunday 20 July 2014

Random Forest implementation using Python Sklearn library

Monday 7 July 2014

Full outer join sample in MySQL - Replace a string in the dataset

Thursday 26 June 2014

Loading a text file into MySQL and joining it with a table - Line delimiter may cause an issue

Knowledge Archive