The Need for Reliable Sources in Data Science Applications

Data Science Essentials For Dummies

Data science seems like a terribly precise field, but the outcomes are only as reliable as your data. The word reliable seems so easy to define when it comes to data sources, yet so hard to implement. A data source is reliable when the results it produces are both expected and consistent. A reliable data source produces mundane data that contains no surprises; no one is shocked in the least by the outcome.

On the other hand, depending on your perspective, it could actually be a good thing that most people aren’t yawning and then falling asleep when reviewing data. That’s because the surprises make the data source worth analyzing and reviewing.

©Shutterstock/KoSSSmoSSS

Consequently, data has an aspect of duality. You want reliable, mundane, fully anticipated data that simply confirms what you already know, but the unexpected is what makes collecting the data useful in the first place.

You can also define reliability by the number of failure points contained in any measured resource. More failure points automatically mean lower reliability if you have two data sources of equal reliability.

Given that general data analysis, AI, machine learning, and deep learning all require huge amounts of information, the methodology used automatically reduces the reliability of such data because you have more failure points to consider. Consequently, you must have data from highly reliable sources of the correct type.

Scientists began fighting against impressive amounts of data for years before anyone coined the term big data. At that point, the Internet didn’t produce the vast sums for data that it does today. Remember that big data is not just simply a fad created by software and hardware vendors but has a basis in many of the following fields:

Astronomy: Consider the data received from spacecraft on a mission (such as Voyager or Galileo) and all the data received from radio telescopes, which are specialized antennas used to receive radio waves from astronomical bodies.

A common example is the Search for Extraterrestrial Intelligence (SETI) project, which looks for extraterrestrial signals by observing radio frequencies arriving from space. The amount of data received and the computer power used to analyze a portion of the sky for a single hour is impressive. If aliens are out there, it’s very hard to spot them. (The movie Contact explores what could happen should humans actually intercept a signal.)

Meteorology: Think about trying to predict weather for the near term given the large number of required measures, such as temperature, atmospheric pressure, humidity, winds, and precipitation at different times, locations, and altitudes. Weather forecasting is really one of the first problems in big data, and quite a relevant one. According to Weather Analytics, a company that provides climate data, more than 33 percent of the Worldwide Gross Domestic Product (GDP) is determined by how weather conditions affect agriculture, fishing, tourism, and transportation, just to name a few.

Dating back to the 1950s, the first supercomputers of the time were used to crunch as much as data as possible because, in meteorology, the more data, the more accurate the forecast. That’s the reason everyone is amassing more storage and processing capacity, as you can read in this story regarding the Korean Meteorological Association for weather forecasting and studying climate change.

Physics: Consider the large amounts of data produced by experiments using particle accelerators in an attempt to determine the structure of matter, space, and time. For example, the Large Hadron Collider, the largest particle accelerator ever created, produces 15PB (petabytes) of data every year as a result of particle collisions.
Genomics: Sequencing a single DNA strand, which means determining the precise order of the many combinations of the four bases — adenine, guanine, cytosine, and thymine — that constitute the structure of the associated molecule, requires quite a lot of data.

For instance, a single chromosome, a structure containing the DNA in the cell, may require from 50MB to 300MB. A human being normally has 46 chromosomes, and the DNA data for just one person consumes an entire DVD. Just imagine the massive storage required to document the DNA data of a large number of people or to sequence other life forms on earth.

Oceanography: Gathers data from the many sensors placed in the oceans to measure statistics, such as temperature and currents, using hydrophones and other sensors. This data even includes sounds for acoustic monitoring for scientific purposes (discovering characteristics about fish, whales, and plankton) and military defense purposes (finding sneaky submarines from other countries).

You can have a sneak peek at this old surveillance problem, which is turning more complex and digital.

Satellites: Recording images from the entire globe and sending them back to earth to monitor the Earth’s surface and its atmosphere isn’t a new business (TIROS 1, the first satellite to send back images and data, dates back to 1960). Over the years, however, the world has launched more than 1,400 active satellites that provide earth observation.

The amount of data arriving on earth is astonishing and serves both military (surveillance) and civilian purposes, such as tracking economic development, monitoring agriculture, and monitoring changes and risks. A single European Space Agency’s satellite, Sentinel 1A, generates 5PB of data during two years of operation.

All these data sources have one thing in common: Someone collects and stores the data as static information (once collected, the data doesn’t change). This means that if errors are found, correcting them with an overall increase in reliability is possible.

The key takeaway here is that you likely deal with immense amounts of data from various sources that could have any number of errors. Finding these errors in such huge quantities is nearly impossible. Using the most reliable sources that you can will increase the overall quality of the original data, reducing the effect of individual data failure points. In other words, sources that provide consistent data are more valuable than sources that don’t.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.