Structured Data in a Big Data Environment
The term structured data generally refers to data that has a defined length and format for big data. Examples of structured data include numbers, dates, and groups of words and numbers called strings. Most experts agree that this kind of data accounts for about 20 percent of the data that is out there. Structured data is the data you’re probably used to dealing with. It’s usually stored in a database.
Sources of structured big data
Although this might seem like business as usual, in reality, structured data is taking on a new role in the world of big data. The evolution of technology provides newer sources of structured data being produced — often in real time and in large volumes. The sources of data are divided into two categories:
Computer- or machine-generated: Machine-generated data generally refers to data that is created by a machine without human intervention.
Human-generated: This is data that humans, in interaction with computers, supply.
Some experts argue that a third category exists that is a hybrid between machine and human. Here though, we’re concerned with the first two categories.
Machine-generated structured data can include the following:
Sensor data: Examples include radio frequency ID tags, smart meters, medical devices, and Global Positioning System data. Companies are interested in this for supply chain management and inventory control.
web log data: When servers, applications, networks, and so on operate, they capture all kinds of data about their activity. This can amount to huge volumes of data that can be useful, for example, to deal with service-level agreements or to predict security breaches.
Point-of-sale data: When the cashier swipes the bar code of any product that you are purchasing, all that data associated with the product is generated.
Financial data: Lots of financial systems are now programmatic; they are operated based on predefined rules that automate processes. Stock-trading data is a good example of this. It contains structured data such as the company symbol and dollar value. Some of this data is machine generated, and some is human generated.
Examples of structured human-generated data might include the following:
Input data: This is any piece of data that a human might input into a computer, such as name, age, income, non-free-form survey responses, and so on. This data can be useful to understand basic customer behavior.
Click-stream data: Data is generated every time you click a link on a website. This data can be analyzed to determine customer behavior and buying patterns.
Gaming-related data: Every move you make in a game can be recorded. This can be useful in understanding how end users move through a gaming portfolio.
When taken together with millions of other users submitting the same information, the size is astronomical. Additionally, much of this data has a real-time component to it that can be useful for understanding patterns that have the potential of predicting outcomes.
The bottom line is that this kind of information can be powerful and can be utilized for many purposes.
The role of relational databases in big data
Data persistence refers to how a database retains versions of itself when modified. The great granddaddy of persistent data stores is the relational database management system. In its infancy, the computing industry used what are now considered primitive techniques for data persistence.
The relational model was invented by Edgar Codd, an IBM scientist, in the 1970s and was used by IBM, Oracle, Microsoft, and others. It is still in wide usage today and plays an important role in the evolution of big data. Understanding the relational database is important because other types of databases are used with big data.
In a relational model, the data is stored in a table. This database would contain a schema — that is, a structural representation of what is in the database. For example, in a relational database, the schema defines the tables, the fields in the tables, and the relationships between the two.
The data is stored in columns, one each for each specific attribute. The data is also stored in the row. The first table stores product information; the second stores demographic information. Each has various attributes. Each table can be updated with new data, and data can be deleted, read, and updated. This is often accomplished in a relational model using a structured query language (SQL).
Another aspect of the relational model using SQL is that tables can be queried using a common key. The common key in the tables is CustomerID.
You can submit a query, for example, to determine the gender of customers who purchased a specific product. It might look something like this:
Select CustomerID, State, Gender, Product from "demographic table", "product table" where Product= XXYY