Data Analysis Example: python example

Tuesday, 5 April 2016

How do you run a scala script in scala command line

You might be using interactive mode of Scala to look at your data. But, if you have written the steps in a dot(.)scala file, then how do you execute. You must know by now that .scala is not mandatory, but its a good practice to follow the convention. We can use the :load option to execute this script from a scala REPL

This is what you do:

1. Write the scala commands in a script CountExample_shell.scala

sc;
val pagecounts = sc.textFile("/home/training/pagecounts/");

// take the first 10 lines and print them
pagecounts.take(10).foreach(println);

pagecounts.count;

// filter only lines that have 'en' (english) for 2nd value of the array
val enPages = pagecounts.filter(_.split(" ")(1) == "en").cache;

enPages.count;

//Create key value pairs in scala
val enTuples = enPages.map(line => line.split(" "));

val enKeyValuePairs = enTuples.map(line => (line(0).substring(0,8) , line(3).toInt));

enKeyValuePairs.reduceByKey(_+_, 1).collect;

enPages.map(l => l.split(" ")).map(l => (l(2), l(3).toInt)).reduceByKey(_+_ , 40).filter( x => x._2 > 200000).map (x => (x._2 , x._1)).collect.foreach(println);

2. Now, to execute the script, use

scala> :load /home/training/CountExample_shell.scala

3. The script will execute and display the below lines.

Loading /home/training/CountExample_shell.scala...
res24: String = /home/training/spark-1.6.0-bin-hadoop2.6/bin/spark-shell
res25: org.apache.spark.SparkContext = org.apache.spark.SparkContext@1fa98a22
pagecounts: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at textFile at <console>:32
20090505-000000 aa Main_Page 2 9980
20090505-000000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 465
20090505-000000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 1 16086
20090505-000000 af.b Tuisblad 1 36236
20090505-000000 af.d Tuisblad 4 189738
20090505-000000 af.q Tuisblad 2 56143
20090505-000000 af Afrika 1 46833
20090505-000000 af Afrikaans 2 53577
20090505-000000 af Australi%C3%AB 1 132432
20090505-000000 af Barack_Obama 1 23368
res27: Long = 1398882
enPages: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at filter at <console>:34
res28: Long = 970545
enTuples: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[14] at map at <console>:36
enKeyValuePairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[15] at map at <console>:38
res29: Array[(String, Int)] = Array((20090507,6175726), (20090505,7076855))
(468159,Special:Search)
(451126,Main_Page)
(1066734,404_error/)

Please post your queries on Scala and let us know your thoughts.

Friday, 16 August 2013

1st Step towards Text-mining - A python solution to find combination variables in Community detection problem

With so much texts being generated everyday, it is now becoming absolutely important to read the text logs.
Mostly, chats will have lines separated by Newline ('\n'). But, we may want to decide the separator as necessary.

So, I wrote the below 2 scripts that read files to read text between the Separator. It can also search for the text in other files.

For example:

We are searching for "Saturday Meetup" in all the chat conversations. Combined with the location "Downtown".
Now, search these 2 variables occurring together in other chats. These users can be combined into the same Community in a Community model. So, they form similar groups for Targeted Interaction (sending gift-pass, coupons etc.)

Sample Methods for the task:

#! /bin/ksh/python

#----------------------------------------------------#

# Project : Python text-modification library #

# Team: Data Sciences #

# Author : #

# Date: 23 June, 2013 #

# Description: Generic classes based on Python #

# Each class has the usage and function #

# defined in header section #

#----------------------------------------------------#

#------------------------------------------------------#

# Generic class to read read all fields separated #

# by a separator , for example: comma(,) #

# Initiate the class using below syntax: #

# x = DataBetweenSeparator(Script1, Separator) #

# Use the method using: #

# x.split(Script1, Separator) #

#------------------------------------------------------#

# Call necessary libraries

class DataBetweenSeparator(object):

def __init__(self,scriptname,sep):

print "DataBetweenSeparator class initiated"

def split(self,scriptname,sep):

import re

import sys

import os.path

self.scriptname = scriptname

self.sep = sep

Sep = self.sep

ScriptName = self.scriptname

#SysArray = sys.argv

#ScriptName = SysArray[1]

ScriptName_tmp = ScriptName + "_tmp"

print "The source file for parsing is", ScriptName

print "The new file is", ScriptName_tmp

sourcefile = open(ScriptName, "r")

runfile = open(ScriptName_tmp, "wb")

for block in sourcefile:

iter = block.count(Sep)

#print "Separators in the current block", iter

count = iter

while count > 0:

# count = iter

runfile.write((block.split(Sep)[iter-count]+ "\n").replace(' ', ''))

print (block.split(Sep)[iter-count]+ "\n").replace(' ', '')

count = count - 1

sourcefile.close()

runfile.close()

return 0

#----------------------------------------#

# Find the variables in the script #

#----------------------------------------#

class FindKeyword(object):

def __init__(self,variable,script):

import re

import sys

import os.path

#SysArray1 = sys.argv

self.variable = variable

self.script = script

Pattern1 = self.variable

script = self.script

print "FindKeyword class initiated"

#ScriptName = SysArray[2]

def search(self,keyword,script):

self.keyword = keyword

self.script = script

Pattern1_tmp = self.keyword

Pattern1 = Pattern1_tmp.replace('\n','')

script = self.script

print "Pattern being searched", Pattern1

print "Filename is", script

sourcefile = file(script)

reportfile = open("Report.txt", "a")

for line in sourcefile:

#print "Search Result", line.find(Pattern1)

#if Pattern1 in line:

if Pattern1 in line:

reportfile.write(Pattern1 + "|True" + "\n")

reportfile.close()

return True

reportfile.write(Pattern1 + "|False" + "\n")

return False

#return 0

#return Pattern1