How to Use Python to Access Data from the Web

By John Paul Mueller, Luca Massaron

It is sometimes necessary to use data from the web for data science. And Python can help. It would be incredibly difficult (perhaps impossible) to find an organization today that doesn’t rely on some sort of web-based data.

Most organizations use web services of some type. A web service is a kind of web application that provides a means to ask questions and receive answers. Web services usually host a number of input types. In fact, a particular web service may host entire groups of query inputs.

Another type of query system is the microservice. Unlike the web service, microservices have a specific focus and provide only one specific query input and output. Microservices essentially work like tiny web services,.

One of the most beneficial data access techniques to know when working with web data is accessing XML. All sorts of content types rely on XML, even some web pages. Working with web services and microservices means working with XML. With this in mind, this example works with XML data found in the XMLData.xml file. In this case, the file is simple and uses only a couple of levels. XML is hierarchical and can become quite a few levels deep.

XML is a hierarchical format that can become quite complex.

XML is a hierarchical format that can become quite complex.

The technique for working with XML, even simple XML, can be a bit harder than anything else you’ve worked with so far. Here’s the code for this example:

from lxml import objectify
import pandas as pd
xml = objectify.parse(open(‘XMLData.xml’))
root = xml.getroot()
df = pd.DataFrame(columns=(‘Number’, ‘String’, ‘Boolean’))
for i in range(0,4):
 obj = root.getchildren()[i].getchildren()
 row = dict(zip([‘Number’, ‘String’, ‘Boolean’],
     [obj[0].text, obj[1].text,
     obj[2].text]))
 row_s = pd.Series(row)
 row_s.name = i
 df = df.append(row_s)
print df

The example begins by importing libraries and parsing the data file using the objectify.parse() method. Every XML document must contain a root node, which is <MyDataset> in this case. The root node encapsulates the rest of the content, and every node under it is a child. To do anything practical with the document, you must obtain access to the root node using the getroot() method.

The next step is to create an empty DataFrame object that contains the correct column names for each record entry: Number, String, and Boolean. As with all other pandas data handling, XML data handling relies on a DataFrame. The for loop fills the DataFrame with the four records from the XML file (each in a <Record> node).

The process looks complex but follows a logical order. The obj variable contains all the children for one <Record> node. These children are loaded into a dictionary object in which the keys are Number, String, and Boolean to match the DataFrame columns.

There is now a dictionary object that contains the row data. The code creates an actual row for the DataFrame next. It gives the row the value of the current for loop iteration. It then appends the row to the DataFrame. To see that everything worked as expected, the code prints the result, which looks like this:

 Number String Boolean
0  1 First True
1  2 Second False
2  3 Third True
3  4 Fourth False