Social Sentiment Analysis with Hadoop - dummies

Social Sentiment Analysis with Hadoop

By Dirk deRoos

Social sentiment analysis is easily the most overhyped of the Hadoop uses, which should be no surprise, given that the world is constantly connected and the current expressive population. This use case leverages content from forums, blogs, and other social media resources to develop a sense of what people are doing (for example, life events) and how they’re reacting to the world around them (sentiment).

Because text-based data doesn’t naturally fit into a relational database, Hadoop is a practical place to explore and run analytics on this data.

Language is difficult to interpret, even for human beings at times — especially if you’re reading text written by people in a social group that’s different from your own. This group of people may be speaking your language, but their expressions and style are completely foreign, so you have no idea whether they’re talking about a good experience or a bad one.

For example, if you hear the word bomb in reference to a movie, it might mean that the movie was bad (or good, if you’re part of the youth movement that interprets “It’s da bomb” as a compliment); of course, if you’re in the airline security business, the word bomb has quite a different meaning. The point is that language is used in many variable ways and is constantly evolving.

When you analyze sentiment on social media, you can choose from multiple approaches. The basic method programmatically parses the text, extracts strings, and applies rules. In simple situations, this approach is reasonable. But as requirements evolve and rules become more complex, manually coding text-extractions quickly becomes no longer feasible from the perspective of code maintenance, especially for performance optimization.

Grammar- and rules-based approaches to text processing are computationally expensive, which is an important consideration in large-scale extraction in Hadoop. The more involved the rules (which is inevitable for complex purposes such as sentiment extraction), the more processing that’s needed.

Alternatively, a statistics-based approach is becoming increasingly common for sentiment analysis. Rather than manually write complex rules, you can use the classification-oriented machine-learning models in Apache Mahout. The catch here is that you’ll need to train your models with examples of positive and negative sentiment. The more training data you provide (for example, text from tweets and your classification), the more accurate your results.

The use case for social sentiment analysis can be applied across a wide range of industries. For example, consider food safety: Trying to predict or identify the outbreak of foodborne illnesses as quickly as possible is extremely important to health officials.

The following figure shows a Hadoop-anchored application that ingests tweets using extractors based on the potential illness: FLU or FOOD POISONING.


Do you see the generated heat map that shows the geographical location of the tweets? One characteristic of data in a world of big data is that most of it is spatially enriched: It has locality information (and temporal attributes, too). In this case, the Twitter profile was reverse-engineered by looking up the published location.

As it turns out, lots of Twitter accounts have geographic locations as part of their public profiles (as well as disclaimers clearly stating that their thoughts are their own as opposed to speaking for their employers).

How good of a prediction engine can social media be for the outbreak of the flu or a food poisoning incident? Consider the anonymized sample data shown. You can see that social media signals trumped all other indicators for predicting a flu outbreak in a specific U.S. county during the late summer and into early fall.


This example shows another benefit that accrues from analyzing social media: It gives you an unprecedented opportunity to look at attribute information in posters’ profiles. Granted, what people say about themselves in their Twitter profiles is often incomplete (for example, the location code isn’t filled in) or not meaningful (the location code might say cloud nine).

But you can learn a lot about people over time, based on what they say. For example, a client may have tweeted (posted on Twitter) the announcement of the birth of her baby, an Instagram picture of her latest painting, or a Facebook posting stating that she can’t believe Walter White’s behavior in last night’s Breaking Bad finale.

In this ubiquitous example, your company can extract a life event that populates a family-graph (a new child is a valuable update for a person-based Master Data Management profile), a hobby (painting), and an interest attribute (you love the show Breaking Bad).

By analyzing social data in this way, you have the opportunity to flesh out personal attributes with information such as hobbies, birthdays, life events, geographical locations (country, state, and city, for example), employer, gender, marital status, and more.

Assume for a minute that you’re the CIO of an airline. You can use the postings of happy or angry frequent travelers to not only ascertain sentiment but also round out customer profiles for your loyalty program using social media information.

Imagine how much better you could target potential customers with the information that was just shared — for example, an e-mail telling the client that Season 5 of Breaking Bad is now available on the plane’s media system or announcing that children under the age of two fly for free.

It’s also a good example of how systems of record (say, sales or subscription databases) can meet systems of engagement (say, support channels). Though the loyalty members’ redemption and travel history is in a relational database, the system of engagement can update records (for example, a column).