Log Data Analysis with Hadoop
Log analysis is a common use case for an inaugural Hadoop project. Indeed, the earliest uses of Hadoop were for the large-scale analysis of clickstream logs — logs that record data about the web pages that people visit and in which order they visit them.
All the logs of data generated by your IT infrastructure often are referred to as data exhaust. A log is a by-product of a functioning server, much like smoke coming from a working engine’s exhaust pipe. Data exhaust has the connotation of pollution or waste, and many enterprises undoubtedly approach this kind of data with that thought in mind.
Log data often grows quickly, and because of the high volumes produced, it can be tedious to analyze. And, the potential value of this data is often unclear. So the temptation in IT departments is to store this log data for as little time as reasonably possible. (After all, it costs money to retain data, and if there’s no perceived business value, why store it?)
But Hadoop changes the math: The cost of storing data is comparatively inexpensive, and Hadoop was originally developed especially for the large-scale batch processing of log data.
The log data analysis use case is a useful place to start your Hadoop journey because the chances are good that the data you work with is being deleted, or dropped to the floor. Some companies that consistently record a terabyte (TB) or more of customer web activity per week discard the data with no analysis (which makes you wonder why they bothered to collect it).
For getting started quickly, the data in this use case is likely easy to get and generally doesn’t encompass the same issues you’ll encounter if you start your Hadoop journey with other (governed) data.
When industry analysts discuss the rapidly increasing volumes of data that exist (4.1 exabytes as of 2014 — more than 4 million 1TB hard drives), log data accounts for much of this growth. And no wonder: Almost every aspect of life now results in the generation of data. A smartphone can generate hundreds of log entries per day for an active user, tracking not only voice, text, and data transfer but also geolocation data.
Most households now have smart meters that log their electricity use. Newer cars have thousands of sensors that record aspects of their condition and use. Every click and mouse movement you make while browsing the Internet causes a cascade of log entries to be generated.
Every time you buy something — even without using a credit card or debit card — systems record the activity in databases — and in logs. You can see some of the more common sources of log data: IT servers, web clickstreams, sensors, and transaction systems.
Every industry (as well as all the log types just described) have the huge potential for valuable analysis — especially when you can zero in on a specific kind of activity and then correlate your findings with another data set to provide context.
As an example, consider this typical web-based browsing and buying experience:
You surf the site, looking for items to buy.
You click to read descriptions of a product that catches your eye.
Eventually, you add an item to your shopping cart and proceed to the checkout (the buying action).
After seeing the cost of shipping, however, you decide that the item isn’t worth the price and you close the browser window. Every click you’ve made — and then stopped making — has the potential to offer valuable insight to the company behind this e-commerce site.
In this example, assume that this business collects clickstream data (data about every mouse click and page view that a visitor touches) with the aim of understanding how to better serve its customers. One common challenge among e-commerce businesses is to recognize the key factors behind abandoned shopping carts. When you perform deeper analysis on the clickstream data and examine user behavior on the site, patterns are bound to emerge.
Does your company know the answer to the seemingly simple question, Are certain products abandoned more than others? Or the answer to the question, How much revenue can be recaptured if you decrease cart abandonment by 10 percent? The following gives an example of the kind of reports you can show to your business leaders to seek their investment in your Hadoop cause.
To get to the point where you can generate the data to build the graphs shown, you isolate the web browsing sessions of individual users (a process known as sessionization), identify the contents of their shopping carts, and then establish the state of the transaction at the end of the session — all by examining the clickstream data.
Following is an example of how to assemble users’ web browsing sessions by grouping all clicks and URL addresses by IP address.
In a Hadoop context, you’re always working with keys and values — each phase of MapReduce inputs and outputs data in sets of keys and values. The key is the IP address, and the value consists of the timestamp and the URL. During the map phase, user sessions are assembled in parallel for all file blocks of the clickstream data set that’s stored in your Hadoop cluster.
The map phase returns these elements:
The final page that’s visited
A list of items in the shopping cart
The state of the transaction for each user session (indexed by the IP address key)
The reducer picks up these records and performs aggregations to total the number and value of carts abandoned per month and to provide totals of the most common final pages that someone viewed before ending the user session.