Sunday 9 January 2011

Extracting pure data from HTML using Python

I am creating a crawler in python for a project at the University and i am playing about with the NLTK trying to create an inverted index for multiple pages. A simple way to get the data from a site is when u read it into a text file u use the following method:

pure=nltk.clean_html(url_contents)

If u want to tokenize the data and store all the words in a list u can do it using the command:

tokens=nltk.word_tokenize(pure)

Iterate through the tokens to tag each word and create a tree data structure.