Graph Processing In Hadoop - dummies

Graph Processing In Hadoop

By Dirk deRoos

One of the more exciting emerging NoSQL technologies involves the storage and processing of graph data. You might think that this statement is old news because computer scientists have been developing graph analysis techniques for decades. What you say may well be true, but what’s new is that by using Hadoop, you can do graph analysis on a large scale.

What is graph data?

A graph in data terms is simply a representation of individual entities and their relationships. A graph’s entities are known as nodes (or vertices), and the relationships between entities in a graph are known as edges (or connections). Representing data sets in a graph, as opposed to traditional rows and columns, makes it much easier to process your data in ways that make the relationships between objects crystal-clear. Typical graph calculations are represented by the shortest path distance between multiple nodes in your graph, or simply by how many nodes have connections of a certain type to a specific node.

Applications for graph analysis

The most well-known application for graph databases is Google’s PageRank algorithm, which calculates the linking relationships between all known web pages. Google represents the web as a giant graph, where the web pages are nodes, and the links from one page to another are represented as edges. (Google shared the wealth by publishing a paper describing its graph analysis project — labeled Pregel — back in 2010.) The graph processing that Google was interested in involved calculating the number of inbound connections for each web page.

Facebook made a significant splash in 2013 when it announced that it was using Apache Giraph (based on the Pregel paper), a graph processing engine designed to process graphs stored in HDFS. It showed the power of Giraph by showing off a graph representing all of Facebook’s users (over 1 billion) and their friendships (billions!), which altogether has over 1 trillion edges. This scale is staggering: If you’re Facebook and you need to make calculations such as friend recommendations, what better tool to use than a graph processing engine? It’s no surprise that a distributed graph database lies at the core of every other notable social media site, including Twitter, LinkedIn, OkCupid, and Pinterest.

A graph processing engine can easily answer many practical questions for social media sites. Two examples are how LinkedIn shows the degrees of separation between you and another user is a shortest path calculation (what’s the closest connection between two nodes?) and how OkCupid shows users with common interests is a set of collaborative filtering calculations (what are the most common connections to a specific set of nodes?).

Graph analysis in Hadoop

As of Spring 2014, graph analysis on Hadoop remains in its early stages. With the advent of YARN in Hadoop 2, graph analysis and other specialized processing techniques will become increasingly popular on Hadoop. Many of the social sites mentioned in this article use their own, proprietary graph databases and processing engines, but Facebook is a prominent user of Giraph. Because Facebook’s (implied) seal of approval, Giraph has become a popular choice for graph analysis on Hadoop, but it has some limitations. It’s solely a processing engine because it loads data as a graph into the cluster’s memory, and it’s optimized for batch-oriented queries.

Another graph processing solution comes from Aurelius, a company that has released a set of open source graph-analysis tools for Hadoop. At the core of its offerings is Titan, a graph database using HBase as a persistence layer, which is optimized for interactive queries, and Faunus, a graph processing engine that stores a snapshot of a graph from Titan in HDFS and runs MapReduce jobs against it. For both the interactive (Titan) and batch (Faunus) applications, Aurelius has the common graph-traversal API named Gremlin.

Finally, the Apache Spark project has the GraphX offshoot, which enables the generation of graph data, and then processing, all within the Spark framework.