Martin Abrahams Team : Web Development

Scraping HTML Content with .net

Martin Abrahams Team : Web Development

In theory, parsing HTML content should be quite simple, it's just XML at the end of the day right? 

The most common approach to attack this problem is to use a simple regular expression or an XML parser to read the content. For a very basic controlled example this will probably work just fine, but if you are looking for something robust that can handle anything a 3rd party may throw at you - including the reality of invalid XHTML, then you will need to look at a specialised HTML parser.

I've heard many people mention HTML Agility Pack over the years and have been looking for a chance to try it out. Last week I needed to extract the text content only from a block of HTML. The HTML block in this case comes from a variety of different sources so the structure could contain anything, but in this case we are only interested in the plain text. I decided to try out HAP, even though it may seem a little overkill.  I was amazed at how easy this was to implement. It ships with the ability to disable strict adherence to XHTML and the ability to read only the text. I was able to do what I needed in 5 lines of code and rest knowing that the support for all the extreme edge cases was there. It also has good community support and is available as a Nuget package which is always a bonus.