Big Data For Dummies
Book image
Explore Book Buy On Amazon

Getting the right perspective on data quality can be very challenging in the world of big data. With the majority of big data sources, you need to assume that you are working with data that is not clean. In fact, the overwhelming abundance of seemingly random and disconnected data in streams of social media data is one of the things that make it so useful to businesses.

You start by searching petabytes of data without knowing what you might find after you start looking for patterns in the data. You need to accept the fact that a lot of noise will exist in the data. It is only by searching and pattern matching that you will be able to find some sparks of truth in the midst of some very dirty data.

Of course, some big data sources such as data from RFID tags or sensors have better-established rules than social media data. Sensor data should be reasonably clean, although you may expect to find some errors. It is always your responsibility when analyzing massive amounts of data to plan for the quality level of that data. You should follow a two-phase approach to data quality:

Phase 1: Look for patterns in big data without concern for data quality.
Phase 2: After you locate your patterns and establish results that are important to the business, apply the same data quality standards that you apply to your traditional data sources. You want to avoid collecting and managing big data that is not important to the business and will potentially corrupt other data elements in Hadoop or other big data platforms.

As you begin to incorporate the outcomes of your big data analysis into your business process, recognize that high-quality data is essential for a company to make sound business decisions. This is true for big data as well as traditional data.

The quality of data refers to characteristics about the data, including consistency, accuracy, reliability, completeness, timeliness, reasonableness, and validity. Data quality software makes sure that data elements are represented in the same way across different data stores or systems to increase the consistency of the data.

For example, one data store may use two lines for a customer’s address and another data store may use one line. This difference in the way the data is represented can result in inaccurate information about customers, such as one customer being identified as two different customers.

A corporation might use dozens of variations of its company name when it buys products. Data quality software can be used to identify all the variations of the company name in your different data stores and ensure that you know everything that this customer purchases from your business.

This process is called providing a single view of customer or product. Data quality software matches data across different systems and cleans up or removes redundant data. The data quality process provides the business with information that is easier to use, interpret, and understand.

Data profiling tools are used in the data quality process to help you to understand the content, structure, and condition of your data. They collect information on the characteristics of the data in a database or other data store to begin the process of turning the data into a more trusted form. The tools analyze the data to identify errors and inconsistencies.

They can make adjustments for these problems and correct errors. The tools check for acceptable values, patterns, and ranges and help identify overlapping data. The data-profiling process, for example, checks to see whether the data is expected to be alpha or numeric. The tools also check for dependencies or to see how the data relates to data from other databases.

Data-profiling tools for big data have a similar function to data-profiling tools for traditional data. Data-profiling tools for Hadoop will provide you with important information about the data in Hadoop clusters. These tools can be used to look for matches and remove duplications. As a result, you can ensure that your big data is consistent. Hadoop tools like HiveQL and Pig Latin can be used for the transformation process.

About This Article

This article is from the book:

About the book authors:

Judith Hurwitz is an expert in cloud computing, information management, and business strategy. Alan Nugent has extensive experience in cloud-based big data solutions. Dr. Fern Halper specializes in big data and analytics. Marcia Kaufman specializes in cloud infrastructure, information management, and analytics.

This article can be found in the category: