Data Warehouse Modernization with Hadoop
Data warehouses are now under stress, trying to cope with increased demands on their finite resources. Hadoop can provide significant relief in this data warehouse situation.
The rapid rise in the amount of data generated in the world has also affected data warehouses because the volumes of data they manage are increasing — partly because more structured data, the kind of data that is strongly typed and slotted into rows and columns — is generated but also because you often have to deal with regulatory requirements designed to maintain queryable access to historical data.
In addition, the processing power in data warehouses is often used to perform transformations of the relational data as it either enters the warehouse itself or is loaded into a child data mart (a separate subset of the data warehouse) for a specific analytics application.
In addition, the need is increasing for analysts to issue new queries against the structured data stored in warehouses, and these ad hoc queries can often use significant data processing resources. Sometimes a one-time report may suffice, and sometimes an exploratory analysis is necessary to find questions that haven’t been asked yet that may yield significant business value.
The bottom line is that data warehouses are often being used for purposes beyond their original design.
The figure shows, using high-level architecture, how Hadoop can live alongside data warehouses and fulfill some of the purposes that they aren’t designed for.
Hadoop is a warehouse helper, not a warehouse replacement. Hadoop can modernize a data warehousing ecosystem in four ways; here they are in summary:
Provide a landing zone for all data.
Persist the data to provide a queryable archive of cold data.
Leverage Hadoop’s large-scale batch processing efficiencies to preprocess and transform data for the warehouse.
Enable an environment for ad hoc data discovery.