Data Lakes For Dummies
Book image
Explore Book Buy On Amazon
A data lake is an enterprise-scale home for analytical data from all corners of your company or governmental agency. No matter what your analytical data landscape looks like today, your organization will benefit from building a data lake.

conceptual graphic of a data lake © Stuart Miles /

Five phases to building a data lake

Your data lake journey begins with a thorough understanding of today’s analytics and data throughout your entire organization. Then, you’ll methodically progress through conceptual and high-level activities into your implementation activities. Follow these phases whose first letters spell out A LAKE:

  1. ASSESS your current state, and score the results.
  2. Prepare a LOFTY VISION for what your data lake will bring you, both technology-wise and in terms of business value.
  3. Decide on your data lake ARCHITECTURE, starting at the conceptual level and then shifting into specific products and services.
  4. Begin with your KICKOFF ACTIVITIES that will deliver the first end-to-end data pipelines that culminate in high-value analytics.
  5. Progressively EXPAND your data lake through subsequent phases.

Three types of data for a data lake

If you’ve been working primarily with traditional data warehouses and data marts, you’re in for a treat. Not only will your data lake include the structured data that you’re used to working with, but you’ll also ingest, manage, and deliver:

  • Semi-structured data, such as tweets, blog posts, and email messages
  • Unstructured data, such as photos, videos, and audio files

Your next generation of analytics will be built from the fusion of these various types of data. Sometimes, the insights you need aren’t just in the numbers or the stats, but in what you can learn from these other forms of data.

Four zones inside a data lake

From the 30,000-foot view, your data lake appears to be a large store of all types of data. When you peel the lid back, though, your data lake should be well organized into the following zones:

  • The bronze zone, where you ingest your raw data into inexpensive storage that is infinitely expandable . . . or at least pretty close to infinitely expandable!
  • The silver zone, where you store your formerly raw data that is now cleansed and enriched
  • The gold zone, where you store curated packages of data that are prepared to support users and analytical needs all across your enterprise
  • The sandbox, where you can quickly place data from elsewhere in your data lake — or even new data coming in from the outside — for experimental or short-term analysis

Supporting an entire analytics continuum

Your data lake will support a broad range of analytics in a coordinated, well-architected manner. Prepare to make use of:

  • Descriptive analytics, which tell you what happened in the past or what’s happening right now
  • Diagnostic analytics, which dig into your descriptive analytics and help you understand why something happened or is happening
  • Predictive analytics, which tell you what’s likely to happen
  • Discovery analytics, in which you turn your analytical power loose on mountains of data with a mission to tell us interesting and important patterns and other insights out of all of this data, without our asking specific questions
  • Prescriptive analytics, which take all your other categories of analytics to the last mile and guide you to decision-making, present you with alternatives for taking action, and make a recommendation for your “best” course of action

About This Article

This article is from the book:

About the book author:

Alan R. Simon, author of Data Warehousing For Dummies, is a manager at Deloitte Consulting. Alan has experienced every side of stock options in public and pre-IPO companies, large Fortune 500 corporations, and small consulting firms.

This article can be found in the category: