Using Python to Work with HTML Pages in Data Science

By John Paul Mueller, Luca Massaron

HTML pages can contain important information for data scientists. Python is a good tool for retrieving that information. HTML pages contain data in a hierarchical format. You often find HTML content in a strict HTML form or as XML.

The HTML form can present problems because it doesn’t always necessarily follow strict formatting rules. XML does follow strict formatting rules because of the standards used to define it, which makes it easier to parse. However, in both cases, you use similar techniques to parse a page. The first section that follows describes how to parse HTML pages in general.

Sometimes you don’t need all the data on a page. Instead you need specific data, which is where XPath comes into play. You can use XPath to locate specific data on the HTML page and extract it for your particular needs.

Parsing XML and HTML

Simply extracting data from an XML file may not be enough. The data may not be in the correct format. Using that approach, you end up with a DataFrame containing three columns of type str. Obviously, you can’t perform much data manipulation with strings. The following example shapes the XML data to create a new DataFrame containing just the <Number> and <Boolean> elements in the correct format.

from lxml import objectify
import pandas as pd
from distutils import util
xml = objectify.parse(open(‘XMLData.xml’))
root = xml.getroot()
df = pd.DataFrame(columns=(‘Number’, ‘Boolean’))
for i in range(0,4):
 obj = root.getchildren()[i].getchildren()
 row = dict(zip([‘Number’, ‘Boolean’],
   [obj[0].pyval,
   bool(util.strtobool(obj[2].text))]))
 row_s = pd.Series(row)
 row_s.name = obj[1].text
 df = df.append(row_s)
print type(df.ix[‘First’][‘Number’])
print type(df.ix[‘First’][‘Boolean’])

Obtaining a numeric value from the <Number> element consists of using the pyval output, rather than the text output. The result isn’t an int, but it is numeric.

The conversion of the <Boolean> element is a little harder. You must convert the string to a numeric value using the strtobool() function in distutils.util. The output is a 0 for False values and a 1 for True values. However, that’s still not a Boolean value. To create a Boolean value, you must convert the 0 or 1 using bool().

This example also shows how to access individual values in the DataFrame. Notice that the name property now uses the <String> element value for easy access. You provide an index value using ix and then access the individual feature using a second index. The output from this example is

<type ‘numpy.float64’>
<type ‘bool’>

Using XPath for data extraction

Using XPath to extract data from your dataset can greatly reduce the complexity of your code and potentially make it faster as well. The following example shows an XPath version of the example above. Notice that this version is shorter and doesn’t require the use of a for loop.

from lxml import objectify
import pandas as pd
from distutils import util
xml = objectify.parse(open(‘XMLData.xml’))
root = xml.getroot()
data = zip(map(int, root.xpath(‘Record/Number’)),
  map(bool, map(util.strtobool,
  map(str, root.xpath(‘Record/Boolean’)))))
df = pd.DataFrame(data,
   columns=(‘Number’, ‘Boolean’),
   index=map(str,
   root.xpath(‘Record/String’)))
print df
print type(df.ix[‘First’][‘Number’])
print type(df.ix[‘First’][‘Boolean’])

The example begins just like the previous example, with the importing of data and obtaining of the root node. At this point, the example creates a data object that contains record number and Boolean value pairs. Because the XML file entries are all strings, you must use the map() function to convert the strings to the appropriate values.

Working with the record number is straightforward — all you do is map it to an int. The xpath() function accepts a path from the root node to the data you need, which is Record/Number in this case.

Mapping the Boolean value is a little more difficult. You must use the util.strtobool() function to convert the string Boolean values to a number that bool() can convert to a Boolean equivalent. However, if you try to perform just a double mapping, you’ll encounter an error message saying that lists don’t include a required function, tolower().To overcome this obstacle, you perform a triple mapping and convert the data to a string using the str() function first.

Creating the DataFrame is different, too. Instead of adding individual rows, you add all the rows at one time by using data. Setting up the column names is the same as before. However, now you need some way of adding the row names, as in the previous example. This task is accomplished by setting the index parameter to a mapped version of the xpath() output for the Record/String path. Here’s the output you can expect:

 Number Boolean
First 1 True
Second 2 False
Third 3 True
Fourth 4 False
<type ‘numpy.int64’>
<type ‘numpy.bool_’>