Big Data: Management

Sorted by:  

Image Classification with Hadoop

Image classification requires a significant amount of data processing resources, however, which has limited the scale of deployments. Image classification is a hot topic in the Hadoop world because no [more…]

How to Choose a Hadoop Distribution

Commercial Hadoop distributions offer various combinations of open source components from the Apache Software Foundation and elsewhere — the idea is that the various components have been integrated into [more…]

How to Choose a Hadoop Cluster Architecture

Hadoop is designed to be deployed on a large cluster of networked computers, featuring master nodes (which host the services that control Hadoop’s storage and processing) and slave nodes [more…]

Apache Bigtop and Hadoop

To help you get started with Hadoop, here are instructions on how to quickly download and set up Hadoop on your own laptop computer. Your cluster will be running in pseudo-distributed mode on a virtual [more…]

Set Up the Hadoop Environment with Apache Bigtop

If you’re comfortable working with VMs and Linux, feel free to install Bigtop on a different VM than what is recommended. If you’re really bold and have the hardware, go ahead and try installing Bigtop [more…]

Your First Hadoop Program: Hello Hadoop!

After the Hadoop cluster is installed and running, you can run your first Hadoop program. This application is very simple, and calculates the total miles flown for all flights flown in one year. The year [more…]

Data Blocks in the Hadoop Distributed File System (HDFS)

When you store a file in HDFS, the system breaks it down into a set of individual blocks and stores these blocks in various slave nodes in the Hadoop cluster. This is an entirely normal thing to do, as [more…]

Replicating Data Blocks in the Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is designed to store data on inexpensive, and more unreliable, hardware. Inexpensivehas an attractive ring to it, but it does raise concerns about the reliability [more…]

Slave Node and Disk Failures in HDFS

Like death and taxes, disk failures (and given enough time, even node or rack failures), are inevitable in Hadoop Distributed File System (HDFS). In the example shown, even if one rack were to fail, the [more…]

Slave Nodes in the Hadoop Distributed File System (HDFS)

In a Hadoop cluster, each data node (also known as a slave node) runs a background process named DataNode. This background process (also known as a daemon [more…]

Keep Track of Data Blocks with NameNode in HDFS

The NameNode acts as the address book for Hadoop Distributed File System (HDFS) because it knows not only which blocks make up individual files but also where each of these blocks and their replicas are [more…]

Checkpointing Updates in Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is a journaled file system, where new changes to files in HDFS are captured in an edit log that’s stored on the NameNode in a file named. Periodically, when the file [more…]

Hadoop Distributed File System (HDFS) Federation

The solution to expanding Hadoop clusters indefinitely is to federate the NameNode. Before Hadoop 2 entered the scene, Hadoop clusters had to live with the fact that NameNode placed limits on the degree [more…]

Hadoop Distributed File System (HDFS) High Availability

Often in Hadoop’s infancy, a great amount of discussion was centered on the NameNode’s representation of a single point of failure. Hadoop, overall, has always had a robust and failure-tolerant architecture [more…]

Compressing Data in Hadoop

The huge data volumes that are realities in a typical Hadoop deployment make compression a necessity. Data compression definitely saves you a great deal of storage space and is sure to speed up the movement [more…]

Managing Files with the Hadoop File System Commands

HDFS is one of the two main components of the Hadoop framework; the other is the computational paradigm known as MapReduce. A distributed file system is a file system that manages storage across a networked [more…]

Log Data with Flume in HDFS

Some of the data that ends up in the Hadoop Distributed File System (HDFS) might land there via database load operations or other types of batch processes, but what if you want to capture the data that’s [more…]

The Importance of MapReduce in Hadoop

For most of Hadoop’s history, MapReduce has been the only game in town when it comes to data processing. The availability of MapReduce has been the reason for Hadoop’s success and at the same time a major [more…]

The MapReduce Application Flow in Hadoop

At its core, MapReduce is a programming model for processing data sets that are stored in a distributed manner across a Hadoop cluster’s slave nodes. The key concept here is [more…]

Input Splits in Hadoop’s MapReduce

The way HDFS has been set up, it breaks down very large files into large blocks (for example, measuring 128MB), and stores three copies of these blocks on different nodes in the cluster. HDFS has no awareness [more…]

The Map Phase of Hadoop’s MapReduce Application Flow

A MapReduce application processes the data in input splits on a record-by-record basis and that each record is understood by MapReduce to be a key/value [more…]

The Shuffle Phase of Hadoop’s MapReduce Application Flow

After the Map phase and before the beginning of the Reduce phase is a handoff process, known as shuffle and sort. Here, data from the mapper tasks is prepared and moved to the nodes where the reducer tasks [more…]

How to Write MapReduce Applications

The MapReduce API is written in Java, so MapReduce applications are primarily Java-based. The following list specifies the components of a MapReduce application that you can develop: [more…]

Running Applications Before Hadoop 2

Because many existing Hadoop deployments still aren’t yet using Yet Another Resource Negotiator (YARN), take a quick look at how Hadoop managed its data processing before the days of Hadoop 2. Concentrate [more…]

Tracking JobTracker and TaskTracker in Hadoop 1

MapReduce processing in Hadoop 1 is handled by the JobTracker and TaskTracker daemons. The JobTracker maintains a view of all available processing resources in the Hadoop cluster and, as application requests [more…]


Sign Up for RSS Feeds

Computers & Software