Analysis and Extraction Techniques for Big Data
In general, text analytics solutions for big data use a combination of statistical and Natural Language Processing (NLP) techniques to extract information from unstructured data. NLP is a broad and complex field that has developed over the last 20 years.
A primary goal of NLP is to derive meaning from text. Natural Language Processing generally makes use of linguistic concepts such as grammatical structures and parts of speech. Often, the idea behind this type of analytics is to determine who did what to whom, when, where, how, and why.
NLP performs analysis on text at different levels:
Lexical/morphological analysis examines the characteristics of an individual word — including prefixes, suffixes, roots, and parts of speech (noun, verb, adjective, and so on) — information that will contribute to understanding what the word means in the context of the text provided. Lexical analysis depends on a dictionary, thesaurus, or any list of words that provides information about those words.
Syntactic analysis uses grammatical structure to dissect the text and put individual words into context. Here you are widening your gaze from a single word to the phrase or the full sentence. This step might diagram the relationship between words (the grammar) or look for sequences of words that form correct sentences or for sequences of numbers that represent dates or monetary values.
Semantic analysis determines the possible meanings of a sentence. This can include examining word order and sentence structure and disambiguating words by relating the syntax found in the phrases, sentences, and paragraphs.
Discourse-level analysis attempts to determine the meaning of text beyond the sentence level.
Understand the extracted information from big data
Certain techniques, combined with other statistical or linguistic techniques to automate the tagging and markup of text documents, can extract the following kinds of information:
Terms: Another name for keywords.
Entities: Often called named entities, these are specific examples of abstractions. Examples are names of persons, names of companies, geographical locations, contact information, dates, times, currencies, titles and positions, and so on. For example, text analytic software can extract the entity Jane Doe as a person referred to in the text being analyzed. The entity March 3, 2007 can be extracted as a date, and so on.
Facts: Also called relationships, facts indicate the who/what/where relationships between two entities. John Smith is the CEO of Company Y and Aspirin reduces fever are examples of facts.
Events: While some experts use the terms fact, relationship, and event interchangeably, others distinguish between events and facts, stating that events usually contain a time dimension and often cause facts to change. Examples include a change in management within a company or the status of a sales process.
Concepts: These are sets of words and phrases that indicate a particular idea or topic with which the user is concerned. For example, the concept unhappy customer may include the words angry, disappointed, and confused and the phrases disconnect service, didn’t call back, and waste of money — among many others. Thus the concept unhappy customer can be extracted without the words unhappy or customer appearing in the text.
Sentiments: Sentiment analysis is used to identify viewpoints or emotions in the underlying text. Some techniques do this by classifying text as, for example, subjective (opinion) or objective (fact), using machine-learning or NLP techniques. Sentiment analysis has become very popular in voice of the customer kinds of applications.
Big data taxonomies
Taxonomies are often critical to text analytics. A taxonomy is a method for organizing information into hierarchical relationships. It is sometimes referred to as a way of organizing categories. Because a taxonomy defines the relationships between the terms a company uses, it makes it easier to find and then analyze text.
For example, a telecommunications service provider offers both wired and wireless service. Within the wireless service, the company may support cellular phones and Internet access. The company may then have two or more ways of categorizing cellular phone service, such as plans and phone types. The taxonomy could reach all the way down to the parts of a phone itself.
Taxonomies can also use synonyms and alternate expressions, recognizing that cellphone, cellular phone, and mobile phone are all the same. These taxonomies can be quite complex and can take a long while to develop.