Basics of Structured and Unstructured Data in Predictive Analysis - dummies

Basics of Structured and Unstructured Data in Predictive Analysis

By Anasse Bari, Mohamed Chaouchi, Tommy Jung

Data contained in databases, documents, e-mails, and other data files for predictive analysis can be categorized either as structured or unstructured data. Structured data is well organized, follows a consistent order, is relatively easy to search and query, and can be readily accessed and understood by a person or a computer program.

A classic example of structured data is an Excel spreadsheet with labeled columns. Such structured data is consistent; column headers — usually brief, accurate descriptions of the content in each column — tell you exactly what kind of content to expect.

Structured data is usually stored in well-defined schemas such as databases. It’s usually tabular, with columns and rows that clearly define its attributes.

Unstructured data, on the other hand, tends to be free-form, non-tabular, dispersed, and not easily retrievable; such data requires deliberate intervention to make sense of it. Miscellaneous e-mails, documents, web pages, and files (whether text, audio, and/or video) in scattered locations are examples of unstructured data.

It’s hard to categorize the content of unstructured data. It tends to be mostly text, it’s usually created in a hodgepodge of free-form styles, and finding any attributes you can use to describe or group it is no small task.

The content of unstructured data is hard to work with or make sense of programmatically. Computer programs cannot analyze or generate reports on such data, simply because it lacks structure, has no underlying dominant characteristic, and individual items of data have no common ground.

In general, there’s a higher percentage of unstructured data than structured data in the world. Unstructured data requires more work to make it useful, so it gets more attention — thus tends to consume more time.

Don’t underestimate the importance of structured data and the power it brings to your analysis. It’s far more efficient to analyze structured data than to analyze unstructured data. Unstructured data can also be costly to preprocess for analysis as you’re building a predictive analytics project. The selection of relevant data, its cleansing, and subsequent transformations can be lengthy and tedious.

The resultant newly organized data from those necessary preprocessing steps can then be used in a predictive analytics model. The wholesale transformation of unstructured data however, may have to wait until you have your predictive analytics model up and running.

Data mining and text analytics are two approaches to structuring text documents, linking their contents, grouping and summarizing their data, and uncovering patterns in that data. Both disciplines provide a rich framework of algorithms and techniques to mine the text scattered across a sea of documents.

It’s also worth noting that search engine platforms provide readily available tools for indexing data and making it searchable.

Let’s compare structured and unstructured data.

Characteristics Structured Unstructured
Association Organized Scattered and dispersed
Appearance Formally defined Free-form
Accessibility Easy to access and query Hard to access and query
Availability Percentagewise lower Percentagewise higher
Analysis Efficient to analyze Additional preprocessing is needed

Unstructured data does not completely lack structure — you just have to ferret it out. Even the text inside digital files still has some structure associated with it, often showing up in the metadata — for example, document titles, dates the files were last modified, and their authors’ names.

The same thing applies for e-mails: The contents may be unstructured, but structured data is associated with them — for example, the date and time they were sent, the names of their senders and recipients, whether they contain attachments.

The separation line between the two data types isn’t always clear. In general, you can always find some attributes of unstructured data that can be considered structured data. Whether that structure is reflective of the content of that data — or useful in data analysis — is unclear at best.

For that matter, structured data can hold unstructured data within it. In a web form, for example, users may be asked to give feedback on a product by choosing an answer from multiple choices — but also presented with a comment box where they can provide additional feedback.

The answers from multiple choices are structured; the comment field is unstructured because of its free-form nature. Such cases are best understood as a mix of structured and unstructured data. Most data is a composite of both.

For a successful predictive analytics project, both your structured and unstructured data must be combined in a logical format that can be analyzed.