I am creating a crawler in python for a project at the University and i am playing about with the NLTK trying to create an inverted index for multiple pages. A simple way to get the data from a site is when u read it into a text file u use the following method:
pure=nltk.clean_html(url_contents)
If u want to tokenize the data and store all the words in a list u can do it using the command:
tokens=nltk.word_tokenize(pure)
Iterate through the tokens to tag each word and create a tree data structure.