Data Analysis Example: December 2014

Tuesday, 23 December 2014

MySQL function to reject invalid email address from websites

Most of us data analysts are facing the challenge of accepting valid data and rejecting invalid data. Please see the below email address validation function in MySql.

The function returns 0 for invalid and 1 for valid emails.

Example:

DELIMITER $$
CREATE FUNCTION fn_email_test(column_value Varchar(200)) RETURNS tinyint(1)
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE is_valid_email BOOLEAN;

SET is_valid_email = 0;

IF column_value NOT REGEXP '^[\x41-\x5a\x61-\x7a\x30-\x39\x2e\x2d\x5f\x25]+\x40[\x41-\x5a\x61-\x7a\x30-\x39\x2e\x2d]+\x2e[\x41-\x5a\x61-\x7a]{2,4}$';
then
SET is_valid_email = 0;
else
set is_valid_email = 1;
END IF;
RETURN is_valid_email;
END$$
DELIMITER ;

Please write your comments to help us improve. Like us on Google+ or Facebook.

Sunday, 21 December 2014

Understanding Implicit Commits in MySql on using Alter table Create Index on temporary table

As we understand from the MySql documents, implicit commits happen when we do a Alter Table or Add Index query within a stored procedure.

So, this means that any DDL changes will be committed to the Database. So, if you do a Alter Table Add Column, the column will be added implicitly. A Rollback will not change the table structure back to its old DDL. The new column or index will be committed.

Any DML not committed since the last successful commit will be committed into the Database as well.

As per the MySQL documentation:

CREATE TABLE and DROP TABLE statements do not commit a transaction if the TEMPORARY keyword is used. (This does not apply to other operations on temporary tables such as ALTER TABLE and CREATE INDEX, which do cause a commit.) However, although no implicit commit occurs, neither can the statement be rolled back, which means that the use of such statements causes transactional atomicity to be violated. For example, if you useCREATE TEMPORARY TABLE and then roll back the transaction, the table remains in existence.

Example Procedure:

drop procedure if exists testproc_2;
DELIMITER $$
CREATE PROCEDURE testproc_2(user_enter varchar(500))
BEGIN

DECLARE v_error bigint;
DECLARE v_file_type, error_msg char(100);

DECLARE EXIT HANDLER FOR sqlexception
BEGIN
SELECT error_msg;
ROLLBACK;
end;

set v_error = 0;

START TRANSACTION;

select user_enter;
insert into int_marks
values (user_enter);

select * from int_marks;

drop temporary table if exists temp_int_marks;
CREATE temporary table temp_int_marks select * from int_marks where 1=2;
ALTER TABLE temp_int_marks ADD INDEX tim_index (member_first_name(20));

select * from int_marks;
select 'Executing';

set error_msg = 'sql failure';
create temporary table temp_int_marks select * from int_marks where 1=2;

END$$
DELIMITER ;

Verifying the DDL:

show create table temp_int_marks;

CREATE TEMPORARY TABLE `temp_int_marks` ( `member_first_name` varchar(50) COLLATE utf8_unicode_ci NOT NULL COMMENT 'Member First Name', KEY `tim_index` (`member_first_name`(20)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Before the procedure the data in the int_marks table:

Call the procedure:

call testproc_2('none');

From within the procedure:

The 'none' value passed to the procedure for insertion is shown. On, any sql failure, this should be rolled back and data not committed into the table.

Outside the procedure:

So, the data has been committed to the DB though there was no explicit commit.
The index addition, being an implicit commit is applied to the Database. This also commits any DML that preceded the commit.

We validated the same by opening a new session. The new row has been added to the Database.

Please let us know if this helps. Like us on Google Plus or Facebook.

Sunday, 14 December 2014

RAID architecture of Teradata

In this post, we will discuss the space requirements of RAID 1 and RAID 5 architecture.

RAID-1 follows the concept of mirroring where data in one disk is also replicated on another disk.
This provides as an ideal fallback option as the entire data is available in another disk. Down time for any disk failure is minimal and so is data loss.
The only drawback is space. RAID-1 implementation would need double the space.

RAID-5 architecture stores the information about each row in a parity bit. So, out of say 4 disks, 1 disk is used up for parity bit. But, only 25% space is used as against RAID-1.

Wednesday, 10 December 2014

Downloading data from a website using urllib2 library of python

As many of you would be integrating web logs into your data warehouse or analytics platforms, the urllib2 library of python is here to help.

Easy to use library (should i say pylike) , this is the answer for data extraction from the websites. There is a json library also, which we will discuss in the next posts.

The below example code, sets the base address of the website (this you need to manually identify). The counter increments the page numbers , as you read each page, starting from the home page.

Please let us know your experiences and learning.

@author: admin
"""
import urllib2
import re

count = 100
limit = 105

f1 = open("C:\\Users\\admin\\Documents\\Federal_1.txt","wb")
f1.close

for count in range(count,limit):
static_add = "https://www.federalregister.gov/articles/text/raw_text/201/423/" + str(count) + ".txt"

try:
page = urllib2.urlopen(static_add)
print "Successfully connected to page"
html = page.read()
f3 = open("C:\\Users\\admin\\Documents\\Federal_1.txt","a")
f3.write(static_add + '\n' + '\n')
f3.write(html)
f3.close
except urllib2.HTTPError, err:
if err.code == 404:
print "Page not found!"
elif err.code == 403:
print "Access denied!"
else:
print "Something happened! Error code", err.code
except urllib2.URLError, err:
print "Some other error happened:", err.reason

count +=1

The files created will contain the raw text of the website.
You can then implement a bagging algorithm to bag words etc.

Data Analysis Example

Pages