The 4 V’s of Big Data
The general consensus of the day is that there are specific attributes that define big data. In most big data circles, these are called the four V’s: volume, variety, velocity, and veracity. (You might consider a fifth V, value.)
The main characteristic that makes data “big” is the sheer volume. It makes no sense to focus on minimum storage units because the total amount of information is growing exponentially every year. In 2010, Thomson Reuters estimated in its annual report that it believed the world was “awash with over 800 exabytes of data and growing.”
For that same year, EMC, a hardware company that makes data storage devices, thought it was closer to 900 exabytes and would grow by 50 percent every year. No one really knows how much new data is being generated, but the amount of information being collected is huge.
Variety is one the most interesting developments in technology as more and more information is digitized. Traditional data types (structured data) include things on a bank statement like date, amount, and time. These are things that fit neatly in a relational database.
Structured data is augmented by unstructured data, which is where things like Twitter feeds, audio files, MRI images, web pages, web logs are put — anything that can be captured and stored but doesn’t have a meta model (a set of rules to frame a concept or idea — it defines a class of information and how to express it) that neatly defines it.
Unstructured data is a fundamental concept in big data. The best way to understand unstructured data is by comparing it to structured data. Think of structured data as data that is well defined in a set of rules. For example, money will always be numbers and have at least two decimal points; names are expressed as text; and dates follow a specific pattern.
With unstructured data, on the other hand, there are no rules. A picture, a voice recording, a tweet — they all can be different but express ideas and thoughts based on human understanding. One of the goals of big data is to use technology to take this unstructured data and make sense of it.
The definition of big data depends on whether the data can be ingested, processed, and examined in a time that meets a particular business’s requirements. For one company or system, big data may be 50TB; for another, it may be 10PB.
Veracity refers to the trustworthiness of the data. Can the manager rely on the fact that the data is representative? Every good manager knows that there are inherent discrepancies in all the data collected.
Velocity is the frequency of incoming data that needs to be processed. Think about how many SMS messages, Facebook status updates, or credit card swipes are being sent on a particular telecom carrier every minute of every day, and you’ll have a good appreciation of velocity. A streaming application like Amazon Web Services Kinesis is an example of an application that handles the velocity of data.
It may seem painfully obvious to some, but a real objective is critical to this mashup of the four V’s. Will the insights you gather from analysis create a new product line, a cross-sell opportunity, or a cost-cutting measure? Or will your data analysis lead to the discovery of a critical causal effect that results in a cure to a disease?
The ultimate objective of any big data project should be to generate some sort of value for the company doing all the analysis. Otherwise, you’re just performing some technological task for technology’s sake.