Techniques Used in Coding Jobs to Analyze Big Data

By Nikhil Abraham

If you are hoping to get a job in coding, you might be asked to analyze big data. When they think of data analysis, many people imagine complex mathematical models. Those equations certainly have their place, but there are other data analysis techniques. This describes some of the ones used to analyze large data sets.

Summarizing data trends and examining outliers

One simple data analysis technique is to graph the data, and see whether there are any extreme outliers or interesting trends. The challenge with this task is finding all the relevant data, and knowing enough about the underlying data set to spot anomalies.

For example, the New York City Department of Health assigns to each of its 24,000 restaurants a letter grade of A (13 or fewer points), B (14 to 27), or C (28 or more, which can close a restaurant until the violations are corrected) based on health code compliance. Ben Wellington, a data analyst and blogger, charted all the restaurant letter grades and noticed that three times as many restaurants scored 13 points, the lowest score that still receives an A, than 14 points, a B grade.

In other words, health inspectors may be inflating grades for those restaurants on the edge of an A or a B score. The finding generated newspaper coverage and responses from the NY Health Department.

Three times as many NYC restaurants scored 13 points than 14 points

Three times as many NYC restaurants scored 13 points than 14 points

Segmenting and aggregating data

Another data analysis technique is to filter your data for certain criteria, and then aggregate the data to see whether there’s an interesting story.

Google, for example, created a flu map called Flu Trends by filtering all their search queries for flu-related search terms. They aggregated the queries by location and highlighted abnormal increases. Traditionally, the U.S. Centers for Disease Control monitors flu outbreaks by reporting on physician visits. In 2009, Flu Trends predicted the flu outbreak in the U.S. in real time, two weeks before the official CDC reports.

Google Flu Trends predicted a flu outbreak before official reports did.

Google Flu Trends predicted a flu outbreak before official reports did.

Combining two or more data sets

The mash up of two different data sets can create unexpected and interesting results. Whenever you combine data sets, the challenges are cleaning the data and understanding how to combine it.

For example, over half of New York City’s drains were clogged, and the city wanted to find restaurants that were illegally dumping grease into the city’s sewers. Ordinarily, the city would be able to inspect only a fraction of its 20,000 restaurants. Instead, city data analysts mapped the location of the clogged drains and restaurant locations that did not have waste management services.

Although mapping locations may sound simple, agencies in NY report location in different ways, such as by GPS, block, or parcel. The resulting list was small enough for city inspectors to tackle, and the initiative resolved 95 percent of the illegal dumping.

Modeling

Much of the advanced big data work, and work you’ll likely do if you become a data analyst, involves some type of modeling. A model is a name given to a math formula used to represent real world data, and many different types of models and formulas exist.

In general, models typically either predict some future value or classify data into categories. For example, models can predict how the US Supreme Court will rule on a particular case, or what movies to watch next given the movies you’ve already seen. In addition, models can classify whether the email you just received is a spam message or a legitimate message, and where the faces are in pictures of people.

Kaggle.com hosts competitions involving real data analysis in which anyone can practice their data skills. Some people use extremely complex models and techniques, but the people who consistently win Kaggle competitions comment that the simple models usually do best.

Improving the models used to predict judicial opinions and classify email require human intervention. Machine learning is the term that describes a set of models that learn and improve performance automatically. There are two categories of learning:

  • Supervised learning: Data with a known structure and relationship is examined.

    For example, the book Moneyball chronicles how Billy Bean, the general manager of the Oakland Athletics, used a player’s on-base percentage and walks as predictors for how many runs the player would score in a game.

  • Unsupervised learning: Data without a known structure or relationship is analyzed to try and find some relationship.

    For example, suppose that you run a dating website and want to divide your users into three to six groups so that you can match people in each group with similar interests. Before looking at people’s profiles, you won’t know how many groups you’ll have in the end or what they will be. After you starting dividing your users, you find that you have a group of people working at startups, a group of middle-aged people interested in art and theater, and a group who likes running and skiing.