Big Data: Management

View:  
Sorted by:  

Developing Oozie Workflows in Hadoop

Oozie workflows are, at their core, directed graphs, where you can define actions (Hadoop applications) and data flow, but with no looping — meaning you can’t define a structure where you’d run a specific [more…]

The Reduce Phase of Hadoop’s MapReduce Application Flow

The Reduce phase processes the keys and their individual lists of values so that what’s normally returned to the client application is a set of key/value pairs. Here’s the blow-by-blow so far: A large [more…]

Local and Distributed Modes of Running Pig Scripts in Hadoop

Before you can run your first Pig script in Hadoop, you need to have a handle on how Pig programs can be packaged with the Pig server.

Pig has two modes for running scripts: [more…]

Scheduling and Coordinating Oozie Workflows in Hadoop

After you’ve created a set of workflows, you can use a series of Oozie coordinator jobs to schedule when they’re executed. You have two scheduling options for execution: a specific time and the availability [more…]

NoSQL Data Stores versus Hadoop

NoSQL data stores originally subscribed to the notion “Just Say No to SQL” (to paraphrase from an anti-drug advertising campaign in the 1980s), and they were a reaction to the perceived limitations of [more…]

ACID versus BASE Data Stores

One hallmark of relational database systems is something known as ACID compliance. As you might have guessed, ACID is an acronym — the individual letters, meant to describe a characteristic of individual [more…]

Structured Data Storage and Processing in Hadoop

When considering Hadoop’s capabilities for working with structured data (or working with data of any type, for that matter), remember Hadoop’s core characteristics: Hadoop is, first and foremost, a general-purpose [more…]

The Hadoop-Based Landing Zone

When you try to puzzle out what an analytics environment might look like in the future, you stumble across the pattern of the Hadoop-based landing zone time and time again. In fact, it’s no longer even [more…]

Hadoop as a Queryable Archive of Cold Warehouse Data

A multitude of studies show that most data in an enterprise data warehouse is rarely queried. Database vendors have responded to such observations by implementing their own methods for sorting out what [more…]

Hadoop as an Archival Data Destination

The inexpensive cost of storage for Hadoop plus the ability to query Hadoop data with SQL makes Hadoop the prime destination for archival data. This use case has a low impact on your organization because [more…]

Hadoop as a Data Preprocessing Engine

One of the earliest use cases for Hadoop in the enterprise was as a programmatic transformation engine used to preprocess data bound for a data warehouse. Essentially, this use case leverages the power [more…]

The Hybrid Data Preprocess Option in Hadoop

In addition to having to store larger volumes of cold data, one pressure you see in traditional data warehouses is that increasing amounts of processing resources are being used for transformation [more…]

Data Transformation in Hadoop

The idea of Hadoop-inspired ETL engines has gained a lot of traction in recent years. After all, Hadoop is a flexible data storage and processing platform that can support huge amounts of data and operations [more…]

Data Discovery and Sandboxes in Hadoop

Data discovery is becoming an increasingly important activity for organizations that rely on their data to be a differentiator. Today, that describes most businesses, as the ability to see trends and extract [more…]

The Attributes of HBase

HBase (Hadoop Database) is a Java implementation of Google’s BigTable. Google defines BigTable as a “sparse, distributed, persistent multidimensional sorted map.” It’s quite a concise definition, but you’ll [more…]

Row Keys in the HBase Data Model

HBase data stores consist of one or more tables, which are indexed by row keys. Data is stored in rows with columns, and rows can have multiple versions. By default, data versioning for rows is implemented [more…]

Column Families in the HBase Data Model

In the HBase data model columns are grouped into column families, which must be defined up front during table creation. Column families are stored together on disk, which is why HBase is referred to as [more…]

Column Qualifiers in the HBase Data Model

In the HBase data model column qualifiers are specific names assigned to your data values in order to make sure you’re able to accurately identify them. Unlike column families, column qualifiers can be [more…]

Data Versions in the HBase Data Model

You can see a number between the column qualifier and value (‘FN’: 1383859182496:‘John,’ for example). That number is the version number for each value in the table. Values stored in HBase are time stamped [more…]

Key Value Pairs in the HBase Data Model

The logical HBase data model is simple yet elegant, and it provides a natural data storage mechanism for all kinds of data — especially unstructured big data sets. All the parts of the data model converge [more…]

RegionServers in HBase

RegionServers are the software processes (often called daemons) you activate to store and retrieve data in HBase (Hadoop Database). In production environments, each RegionServer is deployed on its own [more…]

Regions in HBase

RegionServers are one thing, but you also have to take a look at how individual regions work. In HBase, a table is both spread across a number of RegionServers as well as being made up of individual regions [more…]

Compactions in HBase

Compaction, the process by which HBase cleans up after itself, comes in two flavors: major and minor. Major compactions can be a big deal, but first you need to understand minor compactions. [more…]

The HBase MasterServer

Starting a discussion of HBase (Hadoop Database) architecture by describing RegionServers instead of the MasterServer may surprise you. The term RegionServer [more…]

Zookeeper and HBase Reliability

Zookeeper is a distributed cluster of servers that collectively provides reliable coordination and synchronization services for clustered applications. Admittedly, the name “Zookeeper” may seem at first [more…]

Listings:1-2526-5051-7576-100more...

Sign Up for RSS Feeds

Computers & Software
Great Gadget Giveaway -- Enter to Win!

Inside Dummies.com