Blog

Easy XML Parsing in C#/.NET With the HTML Agility Pack

Easy XML Parsing in C#/.NET With the HTML Agility Pack

In 2006, the newspaper USA Today included the Internet in its list of the "New Seven Wonders of the World." The Internet has democratized access to vast stores of information in an unprecedented way and has transformed many aspects of education, industry, and society. This glut of information can give tremendous insight into everything from trends in the price of chop sticks to predicting regional flu outbreaks across the globe - if you know how to efficiently access it. 

I recently started a personal project to monitor the real estate market in Houston to gain insight into trends in pricing, neighborhood composition, construction, etc. One vital piece of the puzzle was reliably and efficiently gathering detailed information on specific properties to monitor over time. Using Zillow.com's Developer APIs and the HTML Agility Pack, I was able to spin up a prototype over coffee one Sunday morning, and I wanted to share my experience using this powerful C# library.

The Zillow API

The majority of modern websites use JSON as the data exchange format, as it is easily consumed by the widely-used JavaScript language, but there are a significant number of sites (including Zillow.com) that use XML as the exchange format. There are advantages and disadvantages to each, but both formats are straightforward to parse and are human-readable. 

Zillow provides several useful APIs, and I chose GetDeepSearchResults because it encapsulates most of the data I want to track in my application. The documentation is refreshingly complete and useful: you can see details about each field, the XML schema, error codes, and an example usage. After obtaining my developer API key, I scratched out a few lines of code to load the results.

 private void LoadResults(string property)
 {
     // Argument "property" is a URL-encoded string containing a single property address, city, and state
     string addr = "http://www.zillow.com/webservice/GetDeepSearchResults.htm";
     string key = "?[My_API_Key]&";
     string results = String.Empty;

     // Create a new HTTP client to communicate with the Zillow server
     HttpClient client = new HttpClient();
     client.BaseAddress = new Uri(addr);

     // Load the results of the querty
     HttpResponseMessage response = client.GetAsync(key + property).Result;
     if (response.IsSuccessStatusCode)
     {
         results = response.Content.ReadAsStringAsync();
     }
     else
     {
         Debug.WriteLine("Web call failed with code " + response.StatusCode);
     }
     return results;
 }

The LoadResults function takes in a well-formed URL-encoded property listing string that meets the requirements listed in the API documentation, and returns either an empty string (if the API call failed) or an XML-encoded string containing the property details.

Parsing with Agility

I could have spent some time brewing up a RegEx-powered parser for the XML results - after all, this is a well-documented API, and I know what to expect from each call. But why re-invent the wheel? It's also worth pointing out that not all APIs (or web pages, which you can also easily scrape with the HTML Agility Pack) are as consistent and well-formed, so your homebrewed parser might work for a while until it tries to handle a malformed response. 

Using the Agility Pack and the Zillow documentation, I can easily parse out the XML tags I care about without writing a single regex.

 private PropertyData ParseResults(string results)
 {
     // Create an instance of the PropertyData to store the results
     PropertyData data = new PropertyData();

     // Load the API result string
     HtmlDocument doc = new HtmlDocument();
     doc.LoadHtml(results);

     // Find the parameters that will be tracked
     var zpid = doc.DocumentNode.Descendants("zpid").FirstOrDefault().InnerText;
     var finSize = doc.DocumentNode.Descendants("finishedSqFt").FirstOrDefault().InnerText;
     var zestValue = doc.DocumentNode.Descendants("amount").FirstOrDefault().InnerText;
     var baths = doc.DocumentNode.Descendants("bathrooms").FirstOrDefault().InnerText;
     var beds = doc.DocumentNode.Descendants("bedrooms").FirstOrDefault().InnerText;

     // Parse these value strings into their expected data types
     data.Zpid = int.Parse(zpid);
     data.FinishedSize = int.Parse(finSize);
     data.ZestValue = double.Parse(zestValue);
     data.Bathrooms = double.Parse(baths);
     data.Bedrooms = double.Parse(beds);

     return data;
 }

That's all there is to it! No bulky regular expressions or clunky homebrewed parsing code. For clarity, I stripped out a lot of input sanitation and error handling code, so don't forget to check for exception conditions!

Other Applications

You may have an inkling that a parsing tool like the HTML Agility Pack can work for more than just parsing XML responses. You would be correct. Another key area where this library can come into play is data scraping (or web data extraction). If an API is not available or doesn't give you all of the information you need, you can load the web page and parse the whole thing for the data you need. 

It's very important to note that web scraping may violate the Terms of Use for a site, and if your scraping tool misbehaves and loads the webpage at a high rate, you run the risk of compromising the site's resources and awaking the angry IT department kraken! Please be considerate, review the legality of your application, and scrape responsibly!

If you've used the HTML Agility Pack in an application or have questions about how I used it in mine, leave a comment below! Happy coding.

Learn more about DMC's web application development.

Comments

There are currently no comments, be the first to post one.

Post a comment

Name (required)

Email (required)

CAPTCHA image
Enter the code shown above:

Related Blog Posts