The Hadoop-Based Landing Zone
When you try to puzzle out what an analytics environment might look like in the future, you stumble across the pattern of the Hadoop-based landing zone time and time again. In fact, it’s no longer even a futures-oriented discussion because the landing zone has become the way that forward-looking companies now try to save IT costs, and provide a platform for innovative data analysis.
So what exactly is the landing zone? At the most basic level, the landing zone is merely the central place where data will land in your enterprise — weekly extractions of data from operational databases, for example, or from systems generating log files. Hadoop is a useful repository in which to land data, for these reasons:
It can handle all kinds of data.
It’s easily scalable.
Once you land data in Hadoop, you have the flexibility to query, analyze, or process the data in a variety of ways.
This diagram only shows part of the story and is by no means complete. After all, you need to know how the data moves from the landing zone to the data warehouse, and so on.
The starting point for the discussion on modernizing a data warehouse has to be how organizations use data warehouses and the challenges IT departments face with them.
In the 1980s, once organizations became good at storing their operational information in relational databases (sales transactions, for example, or supply chain statuses), business leaders began to want reports generated from this relational data. The earliest relational stores were operational databases and were designed for Online Transaction Processing (OLTP), so that records could be inserted, updated, or deleted as quickly as possible.
This is an impractical architecture for large scale reporting and analysis, so Relational Online Analytical Processing (ROLAP) databases were developed to meet this need. This led to the evolution of a whole new kind of RDBMS: a data warehouse, which is a separate entity and lives alongside an organization’s operational data stores.
This comes down to using purpose-built tools for greater efficiency: you have operational data stores, which are designed to efficiently process transactions, and data warehouses, which are designed to support repeated analysis and reporting.
Data warehouses are under increasing stress though, for the following reasons:
Increased demand to keep longer periods of data online.
Increased demand for processing resources to transform data for use in other warehouses and data marts.
Increased demand for innovative analytics, which requires analysts to pose questions on the warehouse data, on top of the regular reporting that’s already being done. This can incur significant additional processing.
In the figure, you can see the data warehouse presented as the primary resource for the various kinds of analysis listed on the far right side of the figure. Here you also see the concept of a landing zone represented, where Hadoop will store data from a variety of incoming data sources.
To enable a Hadoop landing zone, you’ll need to ensure you can write data from the various data sources to HDFS. For relational databases, a good solution would be to use Sqoop.
But landing the data is only the beginning.
When you’re moving data from many sources into your landing zone, one issue that you’ll inevitably run into is data quality. It’s common for companies to have many operational databases where key details are different, for example, that a customer might be known as “D. deRoos” in one database, and “Dirk deRoos” in another.
Another quality problem lies in systems where there’s a heavy reliance on manual data entry, either from customers or staff — here, it’s not uncommon to find first names and last names switched around or other misinformation in the data fields.
Data quality issues are a big deal for data warehouse environments, and that’s why a lot of effort goes into cleansing and validation steps as data from other systems are processed as it’s loaded into the warehouse. It all comes down to trust: if the data you’re asking questions against is dirty, you can’t trust the answers in your reports.
So while there’s huge potential in having access to many different data sets from different sources in your Hadoop landing zone, you have to factor in data quality and how much you can trust the data.