keep up to date - With the latest web design, development & technology news

Follow wiliam

7May2014

Martin Abrahams Team : Web Development

7May2014

Scraping HTML Content with .net

Martin Abrahams Team : Web Development

In theory, parsing HTML content should be quite simple, it's just XML at the end of the day right?

The most common approach to attack this problem is to use a simple regular expression or an XML parser to read the content. For a very basic controlled example this will probably work just fine, but if you are looking for something robust that can handle anything a 3^rd party may throw at you - including the reality of invalid XHTML, then you will need to look at a specialised HTML parser.

I've heard many people mention HTML Agility Pack over the years and have been looking for a chance to try it out. Last week I needed to extract the text content only from a block of HTML. The HTML block in this case comes from a variety of different sources so the structure could contain anything, but in this case we are only interested in the plain text. I decided to try out HAP, even though it may seem a little overkill. I was amazed at how easy this was to implement. It ships with the ability to disable strict adherence to XHTML and the ability to read only the text. I was able to do what I needed in 5 lines of code and rest knowing that the support for all the extreme edge cases was there. It also has good community support and is available as a Nuget package which is always a bonus.

keep up to date - With the latest web design, development & technology news

Follow wiliam

About wiliam blog

Wiliam is a leading Australian digital agency. We design and develop websites and a few things in-between. The Wiliam blog is the thoughts and opinions of our people.

more blogs

People who
read this
also read
Latest

Top picks from around the web #3

Sandra Stepien 21Mar2018

Ecommerce features to improve UX and redesigns to check out this week.
Sprint Zero: Kicking off a Scrum project the right way

Tom Nason 31Jul2018

The goal of this initial preparatory Sprint is to front-load any work necessary to allow the teams to commence Sprint 1 effectively and without impediments. This includes preparing the Project Roadmap, creating the basic skeleton and plumbing for the project and readying the team for feature development.
Creating a Nuget Package

Simon Miller 6Aug2018

As part of a Wiliam learning project, I needed to find a way to make a reusable component that could be used for the rest of our developers as a starting point on future projects.
Advanced JIRA tips and tricks

Matej Stolfa 8Aug2018

So you’ve grasped the Jira basics and know to steer clear of the 7 Deadly Sins of Using JIRA? It’s time to put your big boy pants, level up and start using JIRA like a PRO. Here are some tips and tricks you will save you a lot of time and impress your colleagues.
What drives a successful Website Services business

Wiliam 15Aug2018

Despite being generated in a time when heavy manufacturing was more predominate, the competitive framework developed by Michael Porter (Harvard Business Review, 1977) is valuable for Website Services businesses.
Driving quality through Non-Functional Requirements

Tom Nason 17Aug2018

It’s common for NFRs to take a back seat in requirement gathering sessions. Topics like scalability and security are rarely met with the same excitement or urgency as customer facing features, yet they are critical to a development project’s success.

Driving quality through Non-Functional Requirements

Tom Nason 17Aug2018

It’s common for NFRs to take a back seat in requirement gathering sessions. Topics like scalability and security are rarely met with the same excitement or urgency as customer facing features, yet they are critical to a development project’s success.
What drives a successful Website Services business

Wiliam 15Aug2018

Despite being generated in a time when heavy manufacturing was more predominate, the competitive framework developed by Michael Porter (Harvard Business Review, 1977) is valuable for Website Services businesses.
Advanced JIRA tips and tricks

Matej Stolfa 8Aug2018

So you’ve grasped the Jira basics and know to steer clear of the 7 Deadly Sins of Using JIRA? It’s time to put your big boy pants, level up and start using JIRA like a PRO. Here are some tips and tricks you will save you a lot of time and impress your colleagues.
Creating a Nuget Package

Simon Miller 6Aug2018

As part of a Wiliam learning project, I needed to find a way to make a reusable component that could be used for the rest of our developers as a starting point on future projects.
Sprint Zero: Kicking off a Scrum project the right way

Tom Nason 31Jul2018

The goal of this initial preparatory Sprint is to front-load any work necessary to allow the teams to commence Sprint 1 effectively and without impediments. This includes preparing the Project Roadmap, creating the basic skeleton and plumbing for the project and readying the team for feature development.
Top picks from around the web #3

Sandra Stepien 21Mar2018

Ecommerce features to improve UX and redesigns to check out this week.
How to design better forms

Sandra Stepien 6Mar2018

Whether we’re buying something online or signing up to an email list, forms are a part of our everyday lives. Use these 4 simple tips to help make forms user-friendly and frictionless.
Case Study: Improving Performance in Entity Framework with Stored Procedures

Andrea James 27Nov2017

A step by step look at improving application performance in a custom .NET MVC website using Entity Framework.
Top 5 web development songs

29Aug2017

Yes, it’s that time of the year. Time to list the top web development themed songs. Prepare to be entertained and inspired. Your brain will never be the same.
Implementing pages from design files

25Aug2017

There are a lot of tutorials on the web on how to use HTML, CSS and Javascript. But it’s difficult to find one that teaches you the big picture, or the steps involved in a real task. This article will address the general work flow, as well as how to approach your code, to produce an accurate recreation of a design.

wiliam.com.au

Featured clients

clients

talk to us0420 521 870orcontact us online

Connect with us

We deliver our promise to clients through 2 focuses:

Online Success and Project Success. Over 15 years, we have delivered hundreds of substantial and complex projects on time, on budget and to the highest of standards.

Sydney 0420 521 870

Level 7, 140 Arthur Street, North Sydney, NSW Australia 2060

© 2024 Wiliam Pty Limited - Website Design Sydney - Web Development Sydney | Privacy