Wednesday, 10 December 2014

Downloading data from a website using urllib2 library of python

As many of you would be integrating web logs into your data warehouse or analytics platforms, the urllib2 library of python is here to help.

Easy to use library (should i say pylike) , this is the answer for data extraction from the websites. There is a json library also, which we will discuss in the next posts.

The below example code, sets the base address of the website (this you need to manually identify). The counter increments the page numbers , as you read each page, starting from the home page.

Please let us know your experiences and learning.

@author: admin
"""
import urllib2
import re

count = 100
limit = 105

f1 = open("C:\\Users\\admin\\Documents\\Federal_1.txt","wb")
f1.close

for count in range(count,limit):
static_add = "https://www.federalregister.gov/articles/text/raw_text/201/423/" + str(count) + ".txt"


try:
page = urllib2.urlopen(static_add)
print "Successfully connected to page"
html = page.read()
f3 = open("C:\\Users\\admin\\Documents\\Federal_1.txt","a")
f3.write(static_add + '\n' + '\n')
f3.write(html)
f3.close
except urllib2.HTTPError, err:
if err.code == 404:
print "Page not found!"
elif err.code == 403:
print "Access denied!"
else:
print "Something happened! Error code", err.code
except urllib2.URLError, err:
print "Some other error happened:", err.reason


count +=1

The files created will contain the raw text of the website.
You can then implement a bagging algorithm to bag words etc.


No comments:

Post a Comment

Please share your thoughts and let us know the topics you want covered