Unstructured Data in a Big Data Environment
Unstructured data is data that does not follow a specified format for big data. If 20 percent of the data available to enterprises is structured data, the other 80 percent is unstructured. Unstructured data is really most of the data that you will encounter. Until recently, however, the technology didn’t really support doing much with it except storing it or analyzing it manually.
Sources of unstructured big data
Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data. Just as with structured data, unstructured data is either machine generated or human generated.
Here are some examples of machine-generated unstructured data:
Satellite images: This includes weather data or the data that the government captures in its satellite surveillance imagery. Just think about Google Earth, and you get the picture.
Scientific data: This includes seismic imagery, atmospheric data, and high energy physics.
Photographs and video: This includes security, surveillance, and traffic video.
Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles.
The following list shows a few examples of human-generated unstructured data:
Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. Enterprise information actually represents a large percent of the text information in the world today.
Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.
Mobile data: This includes data such as text messages and location information.
website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram.
And the list goes on.
Some people believe that the term unstructured data is misleading because each document may contain its own specific structure or formatting based on the software that created it. However, what is internal to the document is truly unstructured.
By far, unstructured data is the largest piece of the data equation, and the use cases for unstructured data are rapidly expanding. On the text side alone, text analytics can be used to analyze unstructured text and to extract relevant data and transform that data into structured information that can be used in various ways.
For example, a popular big data use case is social media analytics for use with high-volume customer conversations. In addition, unstructured data from call center notes, e-mails, written comments in a survey, and other documents is analyzed to understand customer behavior. This can be combined with social media from tens of millions of sources to understand the customer experience.
The role of a CMS in big data management
Organizations store some unstructured data in databases. However, they also utilize enterprise content management systems (CMSs) that can manage the complete life cycle of content. This can include web content, document content, and other forms media.
According to the Association for Information and Image Management (AIIM), a nonprofit organization that provides education, research, and best practices, Enterprise Content Management (ECM) comprises the strategies, methods, and tools used to capture, manage, store, preserve, and deliver content and documents related to organizational processes. The technologies included in ECM include document management, records management, imaging, workflow management, web content management, and collaboration.
A whole industry has grown up around managing content, and many content management vendors are scaling out their solutions to handle large volumes of unstructured data. However, new technologies are also evolving to help support unstructured data and the analysis of unstructured data. Some of these support both structured and unstructured data. Some support real-time streams. These include technologies like Hadoop, MapReduce, and streaming.
Systems that are designed to store content in the form of content management systems are no longer stand-alone solutions. Rather, they are likely to be part of an overall data management solution. For example, your organization may monitor Twitter feeds that can then programmatically trigger a CMS search.
Now, the person who triggered the tweet gets an answer back that offers a location where the individual can find the product that he or she might be looking for. The greatest benefit is when this type of interaction can happen in real time. It also illustrates the value of leveraging real-time unstructured, structured (customer data about the person who tweeted), and semi-structured (the actual content in the CMS) data.
The reality is that you will probably use a hybrid approach to solve your big data problems. For example, it doesn’t make sense to move all your news content, for example, into Hadoop on your premises because it is supposed to help manage unstructured data.