Data Discovery and Sandboxes in Hadoop - dummies

Data Discovery and Sandboxes in Hadoop

By Dirk deRoos

Data discovery is becoming an increasingly important activity for organizations that rely on their data to be a differentiator. Today, that describes most businesses, as the ability to see trends and extract meaning from available data sets applies to almost any industry.

What this requires is two critical components: analysts with the creativity to think of novel ways of analyzing data sets to ask new questions (often these kinds of analysts are called data scientists); and to provide these analysts with access to as much data as possible.

Consider the traditional approach to analytics in today’s IT landscape: The business user community now typically determines the business questions to ask — they submit a request, and the IT team builds a system that answers specific questions. From a technical perspective, because this work has traditionally been done in a relational database, it has been the IT team’s responsibility to build schemas, remove data duplication, and so on.

They’re investing a lot of time into making this data queryable and to quickly answering preplanned questions that the business unit wants answered. This is why relational databases are typically considered schema-on-write because you have to do a lot of work in order to write to the database.

(In many cases, the amount of work is worth the investment; however, in a world of big data, the value and quality of many newer types of data you work with is unknown.)

This relational database approach is well suited to many common business processes, such as monitoring sales by geography, product, or channel; extracting insight from customer surveys, cost and profitability analyses, and more — basically, the questions are asked time and time again.

Data is typically highly structured and is most likely highly trusted in this environment in this environment; this activity is guided analytics.


As an analogy, it’s as though your 8-year-old child is taking a break for recess at school. For the most part, she can do whatever she wants within the school’s grounds — as long as she remains within the fenced perimeter; however, she can’t jump the fence to discover what’s on the outside. Specifically, your child can explore a known, safeguarded (within the schema) area and analyze whatever can be found within that area.

Now imagine that your analytics environment has a discovery zone. In this scenario, IT delivers data (it’s likely not to be fully trusted, and it’s likely “dirty”) on a flexible discovery platform for business users to ask virtually any question they want.

In the analogy, your child is allowed to climb the schoolyard fence (this area is schema-less), venture into the forest, and return with whatever items she discovers. (Of course, in the IT world, you don’t have to worry about business users getting lost or getting poison ivy.)

If you think about it, data discovery mirrors in some respects the evolution of gold mining. During the gold rush years of old, gold strikes would spark resource investment because someone discovered gold — it was visible to the naked eye, it had clear value, and it therefore warranted the investment.

Fifty years ago, no one could afford to mine low-grade ore for gold because cost-effective or capable technology didn’t exist (equipment to move and handle vast amounts of ore wasn’t available) and rich-grade ore was still available (compared to today, gold was relatively easier to find). Quite simply, it wasn’t cost effective (or even possible) to work through the noise (low-grade ore) to find the signals (the gold).

With Hadoop, IT shops now have the capital equipment to process millions of tons of ore (data with a low value per byte) to find gold that’s nearly invisible to the naked eye (data with high value per byte). And that’s exactly what discovery is all about.

It’s about having a low-cost, flexible repository where next-to-zero investment is made to enrich the data until a discovery is made. After a discovery is made, it might make sense to ask for more resources (to mine the gold discovery) and formalize it into an analytics process that can be deployed in a data warehouse or specialized data mart.

When insights are made in the discovery zone, that’s likely a good time to engage the IT department and formalize a process, or have those folks lend assistance to more in-depth discovery. In fact, this new pattern could even move into the area of guided analytics.

The point is that IT provisioned the discovery zone for business users to ask and invent questions they haven’t thought about before. Because that zone resides in Hadoop, it’s agile and allows for users to venture into the wild blue yonder.

Notice that the figure has a sandbox zone. In some reference architectures, this zone is combined with the discovery zone. Keep these zones separate because this area is being used by application developers and IT shops to do their own research, test applications, and, perhaps, formalize conclusions and findings in the Discovery Zone when IT assistance is required after a potential discovery is made.

The reference architecture is flexible and can easily be tweaked. Nothing is cast in stone: you can take what you need, leave what you don’t, and add your own nuances.

For instance, some organizations may choose to co-locate all zones into a single Hadoop cluster; some may choose to leverage a single cluster designed for multiple purposes; and others may physically separate them.