Data Extraction, Movement, and Loading for Data Warehouse Deluxe - dummies

Data Extraction, Movement, and Loading for Data Warehouse Deluxe

By Thomas C. Hammergren

Data warehouse deluxe implementations are big — and getting bigger all the time. Implementations that use hundreds of gigabytes (a gigabyte equals 1 billion bytes) and even terabytes (1 trillion bytes) are increasingly more common. To manage this volume of data and user access, you need a very robust server and database.

Prepare for the challenge! With a data warehouse lite, you can usually handle source-to-warehouse movement of data in a straightforward, low-tech manner — but with the data warehouse deluxe, you’re now entering the Difficulty Zone, where many data warehousing projects meet their Waterloo.

You’re likely to experience difficulty in this domain for several reasons:

  • You’re dealing with many different data sources, some of which might contain overlapping data. For example, suppliers’ information might come from two different purchasing systems, and some of your suppliers have entries in both systems.

    You’ll probably run into different sets of identifiers that you have to converge (for example, six alphanumeric characters that are identified as the SUPPLIER_ID in one of the systems and a unique integer known as SUP_NUM in the other).

  • If your data warehouse is large (measuring more than about 250 gigabytes), you’re likely to experience difficulties in extracting, moving, and loading your batch windows. Batch windows, the time frames in which updates are made to the warehouse, are complicated by the number of data sources you have to handle.

  • The chances of having a messed-up extraction, movement, transformation, and loading process is exponentially related to the number of data elements to be loaded into the data warehouse.

    If you could assign some difficulty factor (an integer, for example) to the process of getting data into the warehouse, the following measures would hold true: You have n data elements that you want to include in the data warehouse with a difficulty factor of x. If you now have 2n data elements, your difficulty factor isn’t 2x; rather, it’s x squared.

    To make this difficulty factor easier to understand, assign some numbers to n and x. Say that your data warehouse has 100 elements (n) and the difficulty factor (x) is 5. If you double the number of elements (n = 200), your difficulty factor is 25 (5 squared), not 10 (5 x 2).

  • The process of dealing with so many data sources, all headed toward one place (your data warehouse deluxe), has all the elements of too many cooks in the kitchen, or whatever that saying is.

    To make the extraction, movement, transformation, and loading process go smoothly, you probably have to deal with many different application owners, official keepers of the database, and other people from a variety of different organizations, all of whom have to cooperate like they’re part of a professional symphony orchestra.

    The reality, though, is that they perform more like a group of kindergarten students who each pick a musical instrument from the toy bin and are told, “Now play something!” Although the process isn’t necessarily doomed to failure, expect a number of iterations until you can get the data warehouse deluxe loaded just right.

A data warehouse deluxe can have three tiers (like a data warehouse lite), except with more data sources and perhaps more than one type of user tool accessing the warehouse. But the architecture for a data warehouse deluxe probably looks more like what’s shown in this figure, with many different collection points for data.


In addition to other necessary “way stations” for your particular environment, your environment might have these elements:

  • Data mart: Receives subsets of information from the data warehouse deluxe and serves as the primary access point for users.

  • Interim transformation station: An area in which sets of data extracted from some of the sources undergo some type of transformation process before moving down the pipeline toward the warehouse’s database.

  • Quality assurance station: An area in which groups of data undergo intensive quality assurance checks before you let them move into the data warehouse.