Data Warehousing and the Infrastructure Challenge - dummies

Data Warehousing and the Infrastructure Challenge

By Thomas C. Hammergren

The nature of a data warehouse (that it’s composed primarily, or exclusively, of data that comes from elsewhere, other application databases, and is converted into a data asset) means that it can’t stand alone as an independent entity within your organization.

The phenomenal growth of distributed computing (Internet and intranet, as well as data warehousing internal and external data) has resulted in a fundamental shift in the way applications are constructed. In the old days of mainframes and minicomputers, a single physical system largely contained the infrastructure (operating systems, databases and file systems, and communications and transaction managers).

With distributed computing now the dominant model (even mainframes and minicomputers are usually part of a larger distributed environment), the infrastructure is spread over many different platforms across your enterprise and possibly outside of your enterprise.

When you develop any application or system, either data warehousing or a more traditional transaction-processing application, you have significant dependencies on pieces of the overall environment over which you have no direct control. Here are some examples specific to data warehousing:

  • You design a data warehouse that, based on business requirements and applications’ data availability policies, must have approximately 25 gigabytes of new and updated data extracted from various sources each evening and sent over the network to the hardware platform on which the data warehouse is running.

    Your corporate networking infrastructure is still undersized. After additional analysis, the network can’t come close to supporting the throughput necessary to move the data into your warehouse in the available time window.

  • During the data warehousing project’s scope phase, you determine that a push strategy to update the data warehouse is the most appropriate model to follow. To implement a push strategy, though, you must modify each source application to include code that detects when that application must push (send) data to the data warehouse.

    The legacy applications that provide data to the warehouse are, unfortunately, so difficult to understand that a policy of making no changes unless absolutely necessary is in effect for each application.

  • You decide to pursue a relational OLAP (or ROLAP) solution and run a series of benchmarks against three relational DBMS (RDBMS) products to see which one best supports informational and decision-support processing (rather than transaction processing).

    The product that performed most poorly in your benchmarks is, unfortunately, also your corporate standard, and any relational database installed anywhere in your company must be of this variety, no matter how you plan to use it.

Think conceptually (not worrying about implementation details) in the early stages of a data warehousing project, or any other application development effort — it’s not only acceptable, it’s also good systems development practice.

At some point, however, you must consider hardware, software, costs, budget, and other types of real-world constraints. Before you begin construction, be sure to consider everything that can affect your designs and plans for your data warehouse.

This project is very similar to building a house. You follow a process whereby you determine your needs, and then the architect draws up blueprints. The blueprints highlight the materials that you need to support your requirements — assuring that the finished product fulfills the vision established in the beginning.