Data Analysis Example: January 2015

Thursday, 22 January 2015

Find_in_set feature of MySQL to find elements in an array or list of Values - Example and Diagrams

In an earlier post , we have seen how using group_concat() function we can group all the values of a column into comma separated array/list of values.

Please see the picture below:

Array elements distributed over 5 different rows and needs to be merged

On doing a group_concat(eligibility_file_staging_complete_id), we get the values in all the rows into a single row.

Example query:

select group_concat(eligibility_file_staging_complete_id)into @v_staging_id_list from temp_multiple_mc_complete;

select @v_staging_id_list;

Merge all columns into a single array using group_concat() function

Now, our requirement is to find each ID from another table and do some operation if the match is found.
We can use the FIND_IN_SET() function for this.

Example query:

select efsc.eligibility_file_staging_complete_id as reference_id, efsc.* from
eligibility_file_staging_complete as efsc
where find_in_set(efsc.eligibility_file_staging_complete_id , @v_staging_id_list)
and file_id = v_file_id;

The above query will pick only those rows that are in the array @v_staging_id_list.

Please like us on Facebook or LinkedIn if you find our blog helpful.

Viewpoint in Teradata - Sample screen to explain the new changes in Teradata monitor

The Teradata performance monitor is now Viewpoint. So, instead of being an application, it can be hosted and shared with developers.

Welcome to the new era of performance tuning, where we see live how the queries are performing in Teradata.
As we see in the diagram below, the skewness, CPU consumption, duration of the query, the PE status and all the other vital statistics are represented in one screen.

Thanks to #Teradata for coming up with this wonderful idea.

Teradata Viewpoint example. Vital stats about the query in one screen

Please let us know how we can help you in improving your Teradata journey. Like us on Google+ or Facebook by using the buttons below.

Wednesday, 21 January 2015

Example script to extract one item from a list of values in MySQL - Use Prepare and Execute - Dynamic SQL

If you have a list of ids in a string and then you want to extract each element from the array of elements, then you can use the below sample script to extract each item.

We are using a while loop and the substring_index function to extract each item.
We have passed the table name as the parameter to the procedure. The rows will be concatenated and then each element from the array is extracted.

The data in my table looks as follows: I need each ID field stored in the last column. The below stored will accomplish this.

The values in column are to be merged using group_concat() to form an array

Example script:

drop procedure if exists eligibilityfile_str_to_column;
DELIMITER $$
CREATE PROCEDURE eligibilityfile_str_to_column(v_tbl_name varchar(200))
begin

DECLARE v_data_extract, v_str_len, v_counter bigint;
DECLARE v_file_type, v_staging_id_list,v_stmnt varchar(500);

set v_data_extract = 0;
set v_str_len = 0;
set v_counter = 1;

select v_tbl_name into @a;
set @v_stmnt = concat('select group_concat(eligibility_file_staging_complete_id) into @v_staging_id_list from ', v_tbl_name);

PREPARE stmt2 from @v_stmnt;
select concat('select group_concat(eligibility_file_staging_complete_id) into @v_staging_id_list from ', v_tbl_name);

EXECUTE stmt2;

# find the total no of delimiters
select char_length(@v_staging_id_list) - char_length(replace(@v_staging_id_list,',','')) into v_str_len;

select concat('length is',v_str_len);

drop temporary table if exists temp_staging_list;
create temporary table temp_staging_list
(staging_id bigint(20) unsigned
);

select 'table created';

# loop to read till all the elements are not read
while v_counter <= v_str_len do
#select 'loop started';
select substring_index(substring_index(@v_staging_id_list,',',v_counter),',',-1) into v_data_extract;
#insert into temp_staging_list select v_data_extract;
select v_data_extract;
set v_counter = v_counter + 1;
end while;

end$$
DELIMITER ;

Insert only unqiue rows in MySQL-reject duplicate rows silently without error

If you use INSERT IGNORE, then the row won't actually be inserted if it results in a duplicate key. But the statement won't generate an error. It generates a warning instead. These cases include:

Inserting a duplicate key in columns with PRIMARY KEY or UNIQUE constraints.
Inserting a NULL into a column with a NOT NULL constraint.
Inserting a row to a partitioned table, but the values you insert don't map to a partition.
If you use REPLACE, MySQL actually does a DELETE followed by an INSERT internally, which has some unexpected side effects:
A new auto-increment ID is allocated.
Dependent rows with foreign keys may be deleted (if you use cascading foreign keys) or else prevent the REPLACE.
Triggers that fire on DELETE are executed unnecessarily.
Side effects are propagated to replication slaves too.

Both REPLACE and INSERT...ON DUPLICATE KEY UPDATE are non-standard, proprietary inventions specific to MySQL. ANSI SQL 2003 defines a MERGE statement that can solve the same need (and more), but MySQL does not support the MERGE statement.

Thursday, 15 January 2015

Date comparison confusion in MySQL - Read the example below

MySQL is pretty straight forward in the documentation. All valid dates should be in 'YYYY-MM-DD' format or 'yy-mm-dd' format.

But developers like us will continue making mistakes and hence learn the hard way.

As we see in the below example, how the format matters in date comparisons.

Example code:

SELECT

fn_str_date_chk(efer.eligibility_end_date)

,date_format(str_to_date(efer.eligibility_end_date,'%m%d%Y'),'%Y-%m-%d') as end_Date

,date_format(str_to_date('11012014','%m%d%Y'),'%Y-%m-%d') as file_date

, case when date_format(str_to_date(efer.eligibility_end_date,'%m%d%Y'),'%Y-%m-%d') < date_format(str_to_date('11012014','%m%d%Y'),'%Y-%m-%d') then 1 else 0 end as newresult

,date_format(str_to_date(efer.eligibility_end_date,'%m%d%Y'),'%m%d%Y') as end_Date

,date_format(str_to_date('11012014','%m%d%Y'),'%m%d%Y') as file_date

, case when date_format(str_to_date(efer.eligibility_end_date,'%m%d%Y'),'%m%d%Y') < date_format(str_to_date('11012014','%m%d%Y'),'%m%d%Y') then 1 else 0 end as oldresult

,efer.*

FROM eligibility_file_error_row as efer

WHERE (fn_str_date_chk(eligibility_end_date) = 1 AND fn_str_date_chk(eligibility_effective_date) = 1)

AND (date_format(str_to_date(eligibility_end_date,'%m%d%Y'),'%m%d%Y') < (select date_format(str_to_date(fn_header_date_retrieve(122228),'%Y-%m-%d'),'%m%d%Y'))

OR str_to_date(eligibility_end_date,'%m%d%Y') < str_to_date(eligibility_effective_date,'%m%d%Y'))

AND file_id = 122228;

The code marked in bold will give you incorrect results as the format is '%m%d%Y'.

Please see the results from the query below in newresult and oldresult column:

Invalid dates or dates with space conversion in MySQL

Tuesday, 13 January 2015

Removing leading hashtags from twitter data - Sample python script

In a recent study, we were trying to remove the leading hashtags from the twitter dumps.

The below python script proved handy. Sharing it with you all.
Please reuse and let us know if you have any comments.

Sample script:

import urlparse
import re
import csv

f1 = open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\Tweets_Corrected.csv","wb")
f1.close

csv_object = csv.reader(open("C:\\Users\\testuser\\Documents\\Data Sciences\\Social BI\\twitter.csv"), delimiter = "~");

data = []
count = 0
for row in csv_object:
count +=1
new_string = ''
print row[2]
var = row[2]
for i in var.split():
s, n, p, pa, q, f = urlparse.urlparse(i)
#print i
if s and n:
pass
elif i[:1] == '@':
pass
elif i[:2] == '"@' or i[:2] == '.@':
pass
elif i[:1] == '#':
new_string = new_string.strip() + ' ' + i[1:]
# removing the leading hashtags from twitter data
else:
new_string = new_string.strip() + ' ' + i

wr_str = ''
wr_str = row[0] + '~'+ row[1] + '~' + new_string + '\n'

print new_string

Like us on Google+ or Facebook if you like our posts.

Monday, 12 January 2015

Detecting and validating email address in weblogs/tweets - MySQL example function

We regularly dump data onto Mysql from different sources. And the hardest part is to cleanse and filter out the desirable inforamation.

One of the key uses of the below function in MySql is to extract valid email addresses and reject the invalid ones.

MySql Example function:

DELIMITER $$

CREATE FUNCTION fn_email_address_test(column_value Varchar(200)) RETURNS tinyint(1)

READS SQL DATA

DETERMINISTIC

BEGIN

DECLARE is_valid_email BOOLEAN;

SET is_valid_email = 0;

IF column_value NOT REGEXP '^[A-Za-z0-9._%-]+@[A-Za-z0-9]+\.[A-Za-z]+[A-Za-z]{2}$'

then

SET is_valid_email = 0;

else

set is_valid_email = 1;

END IF;

RETURN is_valid_email;

END$$

DELIMITER ;

The above function will successfully identify 98% email addresses.

Please let us know if the above post helped.

Give us your valuable feedback in the comment section.

Data Analysis Example

Pages