Text Analytics for Unstructured Big Data
Numerous methods exist for analyzing unstructured data for your big data initiative. Historically, these techniques came out of technical areas such as Natural Language Processing (NLP), knowledge discovery, data mining, information retrieval, and statistics. Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways.
The analysis and extraction processes take advantage of techniques that originated in computational linguistics, statistics, and other computer science disciplines.
Sometimes an example can help to explain a complex topic. Suppose that you work for the marketing department in a wireless phone company. You’ve just launched two new calling plans — Plan A and Plan B — and you are not getting the uptake you wanted on Plan A. The unstructured text from the call center notes might give you some insight as to why this happened.
The underlined words provide the information you might need to understand why Plan A isn’t gaining rapid adoption. For example, the entity Plan A appears throughout the call center notes, indicating that the reports mention the plan.
The terms roll-over minutes, 4GB data, data plan, and expensive are evidence that an issue exists with roll-over minutes, the data plan, and the price. Words like ridiculous and stupid provide insight into the caller sentiment, which in this case is negative.
The text analytics process uses various algorithms, such as understanding sentence structure, to analyze the unstructured text and then extract information, and transform that information into structured data. The structured data extracted from the unstructured text is illustrated in Table 13-1.
|Cust XYZ||Plan A||Roll-over minutes||Neutral|
|Cust ABC||Plan A||Roll-over minutes||Negative|
|XXXX||Plan A||Data plan||Neutral|
|Cust XYT||Plan A||Data plan||Negative|
You may look at this and say, “But I could have figured that out by looking at the call center records.” However, these are just a small subset of the information being recorded by thousands of call center agents. Each individual agent cannot possibly sense a broad trend regarding the problem with each plan being offered by the company.
Agents do not have the time or requirement to share this information across all the other call center agents who may be getting similar numbers of calls about Plan A. However, after this information is aggregated and processed using text analytics algorithms, a trend may emerge from this unstructured data. That’s what makes text analytics so powerful.
Search is about retrieving a document based on what end users already know they are looking for. Text analytics is about discovering information. While text analytics differs from search, it can augment search techniques. For example, text analytics combined with search can be used to provide better categorization or classification of documents and to produce abstracts or summaries of documents.
There are four technologies: query, data mining, search, and text analytics. On the left side of the table are query and search, which are both about retrieval. For example, an end user could query a database to find out how many customers stopped using the company’s services in the past month.
The query would return a single number. Only by asking more and different queries will the end user get the information required to determine why customers are leaving. Likewise, keyword search allows the end user to find the documents that contain the names of a company’s competitors. The search would return a group of documents. Only by reading the documents would the end user come up with any relevant answers.
|Structured||Query: Returns data||Data mining: Insight from structured data|
|Unstructured||Search: Returns documents||Text analytics: Insight from text|
The technologies on the left return pieces of information and require human interaction to synthesize and analyze that information. The technologies on the right — data mining and text analytics — deliver insight much more quickly. Hopefully, the value of text analytics to your organization is becoming clear.