Thursday 24 July 2014

Using Group_Concat() function in MySQL - Handy function to extract all values of a column

The Group_Concat() function in MySQL saved the day for me this week. So, a post thanking the authors. :)

I was loading some basic geographical data of Users who logged in to a website. The data looked as follows:

Row_id Language Name
42914 ES Reader1
42915 ES Reader1
44623 EN Reader1
44624 EN Reader1
44625 EN Reader1

The dataset is actually pretty heavy with lots of other columns. So, my objective was to have a single row for 'NAME' with all the languages he uses in a delimited format. 

For example, the result output should become:

Name  Languages
Reader1 ES | EN
Reader2 EN | AR
And the group_concat() function in MySQL was just what I was looking for.

Example query:

select
 Name
, group_concat(distinct actor_languages separator '|')
 from tweet_data
group by Name
;

The query will group your dataset by Name, and create a pipe separated string of all the languages in the language column.
Please remember to use distinct. Else, we will have one entry for each row of Name, and this will become a really long string.

Suggestions or comments are welcome as always !!


Sunday 20 July 2014

Random Forest implementation using Python Sklearn library

Python is wonderful programming language. The more I explore, the more I am amazed about the power of this language.

Below is a recent Random forest implementation using Pandas and Sklearn libraries.

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is Random forest implementation.
"""


import numpy as np
import csv as csv
import scipy as sp
import pandas as pandas
import matplotlib as mpl
import re


print "import successful"

csv_object = csv.reader(open("C:\\train.csv"), delimiter = ",")

# skipping the header
header = csv_object.next()

data = []
count = 0
for row in csv_object:
count +=1
data.append(row)
data = np.array(data)

#print data[0:15,5]
#print type(data[0::,5])

# Importing the pandas module to replace missing values

df = pd.read_csv("C:\\train.csv", header = 0)


# Extracting Salutation from Names

Salutation = ['Rev.', 'Mrs.', 'Miss.', 'Master.', 'Mr.']
#Salutation = ['Rev.']

df['Salutation'] = 100


pattern = ''

for i in xrange(0,len(Salutation)):
pattern = Salutation[i]
print "pattern is",pattern
# print df[df['Name'].str.contains(pattern)]['Salutation'].replace(False, i)
#df['Salutation']= df['Name'].str.contains(pattern).replace('False', i)
df.loc[(df['Name'].str.contains(pattern)), 'Salutation'] = i
#print pattern, df['Salutation']
print df.head(2)

print df[(df.Salutation != 100)][['Name', 'Salutation']]


df['Gender'] = df.Sex.map({'female':0 , 'male' : 1}). astype(int)
#print df[(df.Gender == 1) & (df.Pclass == 1)]

# Replace the ages in the dataset

median_ages = np.zeros((2,3))

for i in range(0,2):
for j in range(0,3):
median_ages[i,j] = df[(df.Gender == i) & (df.Pclass == j+1)]['Age'].dropna().median()

#print median_ages

df['AgeFill'] = df['Age']
#print df[df.Age.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex', 'Survived']]

for i in range(0,2):
for j in range(0,3):
df.loc[(df.Age.isnull()) & (df.Gender == i) & (df.Pclass == j+1) , 'AgeFill'] = median_ages[i,j]

df['EmbarkedCode'] = df.Embarked.map({'C':0, 'S' : 1, 'Q':2})

#print df[df.Fare.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex', 'Survived']]
#print df[df.Age.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex', 'Survived', 'EmbarkedCode']]

#Convert the integer dataframe into numpy arraay

df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis = 1)
#for i in range(0,3):
# print i, len(df[(df.EmbarkedCode == i)])

df.loc[df.EmbarkedCode.isnull()] = 1
#print df[df.EmbarkedCode.isnull()][['PassengerId']]
#print df.head(1)
#df = df.reindex(columns = pandas.Index(['Survived']).append(df.columns - ['Survived']))
df = df[['Survived', 'PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fare', 'Gender', 'AgeFill', 'EmbarkedCode', 'Salutation']]
print df.head(2)
cols = df.columns.tolist()


train_data = df.values
print "Train data sample"
#print train_data[0::,1::]
#print train_data[0::,0]



# Start with Random Forest implementation

from sklearn.ensemble import RandomForestClassifier

# Initiate the random forest object

forest = RandomForestClassifier(n_estimators = 50)


# Fit the model
print "Fit the model"
forest = forest.fit(train_data[0::,1::],train_data[0::,0])

#Import the test data and run on the forest
testdata = pd.read_csv("C:\\test.csv",header = 0)

# Extracting the salutation from Name

Salutation = ['Rev.', 'Mrs.', 'Miss.', 'Master.', 'Mr.']
#Salutation = ['Rev.']

testdata['Salutation'] = 100


pattern = ''

for i in xrange(0,len(Salutation)):
pattern = Salutation[i]
print "pattern is",pattern
# print df[df['Name'].str.contains(pattern)]['Salutation'].replace(False, i)
#df['Salutation']= df['Name'].str.contains(pattern).replace('False', i)
testdata.loc[(testdata['Name'].str.contains(pattern)), 'Salutation'] = i
#print pattern, df['Salutation']
print testdata.head(2)

print testdata[(testdata.Salutation != 100)][['Name', 'Salutation']]

for i in range(1,4):
print "Class wise median Fare", i, testdata[(testdata.Pclass == i) & (testdata.Embarked == 'S')]['Fare'].dropna().median()

testdata['Gender'] = testdata.Sex.map({'female':0 , 'male' : 1}). astype(int)

median_ages = np.zeros((2,3))

for i in range(0,2):
for j in range(0,3):
median_ages[i,j] = testdata[(testdata.Gender == i) & (testdata.Pclass == j+1)]['Age'].dropna().median()

#print median_ages

testdata['AgeFill'] = testdata['Age']
#print testdata[testdata.Age.isnull()][['Age', 'AgeFill', 'Pclass', 'Sex']]

for i in range(0,2):
for j in range(0,3):
testdata.loc[(testdata.Age.isnull()) & (testdata.Gender == i) & (testdata.Pclass == j+1) , 'AgeFill'] = median_ages[i,j]

testdata['EmbarkedCode'] = testdata.Embarked.map({'C':0, 'S' : 1, 'Q':2})

testdata = testdata.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age'], axis = 1)


testdata.loc[(testdata.EmbarkedCode.isnull()), 'EmbarkedCode'] = 1
testdata.loc[(testdata.Fare.isnull()), 'Fare'] = 8.05

test_data = testdata.values[0:]

output = forest.predict(test_data)
print forest
#print output[12]
#print test_data[152,0]

Final_output = np.zeros((418,2))

for i in range(0,418):
j = 0
str = test_data[i,0]
str.flatten()

Final_output[i,j] = str
for j in range (1,2):

output1 = output[i]
output1.flatten()

Final_output[i,j] = output1


np.savetxt("C:\\test_rf.csv",Final_output,delimiter = ",")



Please let us know if you want to discuss more about any advanced analytics problem.













Monday 7 July 2014

Full outer join sample in MySQL - Replace a string in the dataset

Below is a sample query for doing a full outer join in MySQL:

The data joined from output_1 , output_2, output_3 and output_4 tables based on actor_link column.
We are also using the Replace() function of MySQL to replace a string in the resultant dataset.

Example Query:

select Replace(coalesce(out1.actor_link, out2.actor_link, out3.actor_link, out4.actor_link), 'http://www.twitter.com/',''), coalesce(cnt1,0)+ coalesce(cnt2,0) + coalesce(cnt3,0) + coalesce(cnt4, 0)
from
(select actor_link, count(gen_key) as cnt1 from output_1 group by actor_link) out1
left outer join
(select actor_link, count(gen_key) as cnt2 from output_2 group by actor_link) out2
on out1.actor_link = out2.actor_link
left outer join
(select actor_link , count(gen_key) as cnt3 from output_3 group by actor_link) out3
on out1.actor_link = out3.actor_link
left outer join
(select actor_link , count(gen_key) as cnt4 from output_4 group by actor_link) out4
on out1.actor_link = out4.actor_link
UNION
select Replace(coalesce(out1.actor_link, out2.actor_link, out3.actor_link, out4.actor_link), 'http://www.twitter.com/',''), coalesce(cnt1,0)+ coalesce(cnt2,0) + coalesce(cnt3,0) + coalesce(cnt4, 0)
from
(select actor_link, count(gen_key) as cnt1 from output_1 group by actor_link) out1
right outer join
(select actor_link, count(gen_key) as cnt2 from output_2 group by actor_link) out2
on out1.actor_link = out2.actor_link
right outer join
(select actor_link , count(gen_key) as cnt3 from output_3 group by actor_link) out3
on out1.actor_link = out3.actor_link
right outer join
(select actor_link , count(gen_key) as cnt4 from output_4 group by actor_link) out4
on out1.actor_link = out4.actor_link
where out1.actor_link <> 'actor__link'
;


Please let us know if you need any help in your Analytics exploits.