Thursday, March 19, 2009

HTML Screen Scraping

For some little side project I found myself screen scraping some HTML sites for information. My first idea was to access the pages with the URL class and then use TagSoup for parsing (see this Blog Entry for an example). This in fact worked quite well and using XPath from Scala was a blast.

Nevertheless the scraping sometimes didn't work because for some weird reason the site I was scraping demanded a JavaScript enabled browser (and sending forms is no real fun with that approach). So I turned to HTMLUnit which seems to be an even better screen scrape tool.

Now what we really need is a HTMLUnit which gives us simple access to a TagSoup of the content...


Post a Comment

<< Home