Working with Graph Data in Python for Data Science - dummies

Working with Graph Data in Python for Data Science

By John Paul Mueller, Luca Massaron

Most data scientists must work with graph data at some point. Python gives you that functionality. Imagine data points that are connected to other data points, such as how one web page is connected to another web page through hyperlinks. Each of these data points is a node. The nodes connect to each other using links.

Not every node links to every other node, so the node connections become important. By analyzing the nodes and their links, you can perform all sorts of interesting tasks in data science, such as defining the best way to get from work to your home using streets and highways.

Understanding the adjacency matrix

An adjacency matrix represents the connections between nodes of a graph. When there is a connection between one node and another, the matrix indicates it as a value greater than 0. The precise representation of connections in the matrix depends on whether the graph is directed (where the direction of the connection matters) or undirected.

A problem with many online examples is that the authors keep them simple for explanation purposes. However, real-world graphs are often immense and defy easy analysis simply through visualization. Just think about the number of nodes that even a small city would have when considering street intersections. Many other graphs are far larger, and simply looking at them will never reveal any interesting patterns. Data scientists call the problem in presenting any complex graph using an adjacency matrix a hairball.

One key to analyzing adjacency matrices is to sort them in specific ways. For example, you might choose to sort the data according to properties other than the actual connections. A graph of street connections might include the date the street was last paved with the data, making it possible for you to look for patterns that direct someone based on the streets that are in the best repair. In short, making the graph data useful becomes a matter of manipulating the organization of that data in specific ways.

Using NetworkX basics

Working with graphs could become difficult if you had to write all the code from scratch. Fortunately, the NetworkX package for Python makes it easy to create, manipulate, and study the structure, dynamics, and functions of complex networks (or graphs). You can use the package to work with digraphs and multigraphs as well.

The main emphasis of NetworkX is to avoid the whole issue of hairballs. The use of simple calls hides much of the complexity of working with graphs and adjacency matrices from view. The following example shows how to create a basic adjacency matrix from one of the NetworkX-supplied graphs:

import networkx as nx
G = nx.cycle_graph(10)
A = nx.adjacency_matrix(G)

The example begins by importing the required package. It then creates a graph using the cycle_graph() template. The graph contains ten nodes. Calling adjacency_matrix() creates the adjacency matrix from the graph. The final step is to print the output as a matrix, as shown here:

[[0 1 0 0 0 0 0 0 0 1]
 [1 0 1 0 0 0 0 0 0 0]
 [0 1 0 1 0 0 0 0 0 0]
 [0 0 1 0 1 0 0 0 0 0]
 [0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 1 0 1 0 0 0]
 [0 0 0 0 0 1 0 1 0 0]
 [0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 0 0 0 1 0 1]
 [1 0 0 0 0 0 0 0 1 0]]

You don’t have to build your own graph from scratch for testing purposes. The NetworkX site documents a number of standard graph types that you can use, all of which are available within IPython.

It’s interesting to see how the graph looks after you generate it. The following code displays the graph for you.

Plotting the original graph.
Plotting the original graph.
import matplotlib.pyplot as plt

The plot shows that you can add an edge between nodes 1 and 5. Here’s the code needed to perform this task using the add_edge() function.

Plotting the graph addition.
Plotting the graph addition.