Middleware Services: Data Selection and Extractions - dummies

Middleware Services: Data Selection and Extractions

By Thomas C. Hammergren

The primary purpose of the data-selection and –extraction service is to select from (find in) a data source the data that you want to move into the data warehouse and then extract (pull out) that data into a form that can be readied for quality assurance services.

You can use one of two different types of selection and extraction services for your data warehousing environment:

  • Get ’em all and sort ’em out later: Find and extract all the data elements in a source that you want to load into your data warehouse, regardless of whether a specific element has been previously extracted.

  • Change-oriented: Find and extract only the data elements that have been either newly added to the data source or updated since the last extraction.

The first type of service requires less complex logic in order to perform the extraction. But you have to deal with larger volumes (sometimes, much larger volumes) of data than with the second type, the change-oriented service.

The change-oriented method of selection and extraction is fairly straightforward when your source is a relational database that has a time stamp you can use to detect when a row of data was added or last updated.

You can compare a row of data against the date and time of the last extraction process to determine whether data needs to be selected and extracted. But when the data is stored in a file that doesn’t have a time stamp (a VSAM file, for example), this process can be significantly more difficult.

You might also face a challenge when source data has been deleted from either a file or a database. If the business rules for your data warehousing environment call for the deletion of corresponding data from the warehouse, you must have a way to detect deletions that were made since the last extraction process to ensure that appropriate deletions are made in your warehouse.

The result of the selection and extraction is, well, an extract of data that’s ready to undergo additional processing: checking out the data quality.