Defining Big Data by Its Four Vs
Four characteristics (the four Vs) define big data: volume, velocity, variety, and value. Big data is data that exceeds the processing capacity of conventional database systems because it’s too big, it moves too fast, or it doesn’t fit the structural requirements of traditional database architectures. Whether data volumes rank in the terabyte or petabyte scales, data engineering solutions must be designed to meet requirements for the data’s intended destination and use.
When you’re talking about regular data, you’re likely to hear the words kilobyte or gigabyte used — 103 and 109 bytes, respectively. In contrast, when you’re talking about big data, words like terabyte and petabyte are thrown around — 1012 and 1015 bytes, respectively. A byte is an 8-bit unit of data.
Since the four Vs of big data are continually growing, newer, more innovative data technologies must continuously be developed to manage big data problems.
Whenever you’re in doubt, use the 4V criteria to determine whether you have a big-data or regular-data problem on your hands.
Grappling with data volume
The lower limits of big data volumes range between a few terabytes, up to tens of petabytes, on an annual basis. The volume numbers by which a big data set is defined have no upper limit. In fact, the volumes of most big data sets are increasing exponentially on a yearly basis.
Handling data velocity
Automated machinery and sensors are generating high-velocity data on a continual basis. In engineering terms, data velocity is data volume per unit time. Big data velocities range anywhere between 30 kilobytes (K) per second up to even 30 gigabytes (GB) per second. High-velocity, real-time data streams present an obstacle to timely decision making. The capabilities of data-handling and data-processing technologies often limit data velocities.
Dealing with data variety
Big data makes everything more complicated by adding unstructured and semi-structured data in with the structured datasets. This is called high–variety data. High-variety data sources can be derived from data streams that are generated from social networks or from automated machinery.
Structured data is data that can be stored, processed, and manipulated in a traditional relational database management system. This data can be generated by humans or machines, and is derived from all sorts of sources, from click-streams and web-based forms to point of sale transactions and sensors.
Unstructured data comes completely unstructured — it’s commonly generated from human activities and doesn’t fit into a structured database format. Such data could be derived from blog posts, emails, and Word documents.
Semi-structured data is data that doesn’t fit into a structured database system, but is nonetheless structured by tags that are useful for creating a form of order and hierarchy in the data. Semi-structured data is commonly found in database and file systems. It can be stored as log files, XML files, or JSON data files.
Creating data value
In its raw form, most big data is low value — in other words, the value-to-data quantity ratio is low in raw big data. Big data is comprised of huge numbers of very small transactions that come in a variety of formats. These incremental components of big data produce true value only after they’re rolled up and analyzed. Data engineers have the job of rolling it up and data scientists have the job of analyzing it.