Set Up the Hadoop Environment with Apache Bigtop

By Dirk deRoos

If you’re comfortable working with VMs and Linux, feel free to install Bigtop on a different VM than what is recommended. If you’re really bold and have the hardware, go ahead and try installing Bigtop on a cluster of machines in fully distributed mode!

Step 1: Downloading a VM

Hadoop runs on all popular Linux distributions, so you need a Linux VM. There is a freely available (and legal!) CentOS 6 image available.

You will need a 64-bit operating system on your laptop in order to run this VM. Hadoop needs a 64-bit environment.

After you’ve downloaded the VM, extract it from the downloaded Zip file into the destination directory. Do ensure you have around 50GB of space available as Hadoop and your sample data will need it.

If you don’t already have a VM player, you can download one for free.

After you have your VM player set up, open the player, go to File→Open, then go to the directory where you extracted your Linux VM. Look for a file called and select it. You’ll see information on how many processors and how much memory it will use. Find out how much memory your computer has, and allocate half of it for the VM to use. Hadoop needs lots of memory.

Once you’re ready, click the Play button, and your Linux instance will start up. You’ll see lots of messages fly by as Linux is booting and you’ll come to a login screen. The user name is already set to “Tom.” Specify the password as “tomtom” and log in.

Step 2: Downloading Bigtop

From within your Linux VM, right-click on the screen and select Open in Terminal from the contextual menu that appears. This opens a Linux terminal, where you can run commands. Click inside the terminal so you can see the cursor blinking and enter the following command: su –

You’ll be asked for your password, so type “tomtom” like you did earlier. This command switches the user to root, which is the master account for a Linux computer — you’ll need this in order to install Hadoop.

With your root access (don’t let the power get to your head), run the following command:

wget -O /etc/yum.repos.d/bigtop.repo

The command is essentially a web request, which requests a specific file in the URL you can see and writes it to a specific path — in this case, that’s /.

Step 3: Installing Bigtop

The geniuses behind Linux have made life quite easy for people who need to install big software packages like Hadoop. What you downloaded in the last step wasn’t the entire Bigtop package and all its dependencies. It was just a repository file (with the extension), which tells an installer program which software packages are needed for the Bigtop installation.

Like any big software product, Hadoop has lots of prerequisites, but you don’t need to worry. A well-designed file will point to any dependencies, and the installer is smart enough to see if they’re missing on your computer and then download and install them.

The installer you’re using here is called yum, which you get to see in action now:

yum install hadoop* mahout* oozie* hbase* hive* hue* pig* zookeeper*

Notice that you’re picking and choosing the Hadoop components to install. There are a number of other components available in Bigtop, but these are the only ones you’ll be using here. Since the VM is a fresh Linux install, you’ll need many dependencies, so you’ll need to wait a bit.

The yum installer is quite verbose, so you can watch exactly what’s being downloaded and installed to pass the time. When the install process is done, you should see a message that says “Complete!”

Step 4: Starting Hadoop

Before you start running applications on Hadoop, there are a few basic configuration and setup things you need to do. Here they are in order:

  1. Download and install Java:

        yum install java-1.7.0-openjdk-devel.x86_64
  2. Format the NameNode:

        sudo /etc/init.d/hadoop-hdfs-namenode init
  3. Start the Hadoop services for your pseudodistributed cluster:

        for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ;     do sudo service $i start ; done
  4. Create a sub-directory structure in HDFS:

        sudo /usr/lib/hadoop/libexec/
  5. Start the YARN daemons:

    sudo service hadoop-yarn-resourcemanager startsudo service hadoop-yarn-nodemanager start

And with that, you’re done. Congratulations! You’ve installed a working Hadoop deployment!

Step 5: Downloading the sample data set

To download the sample data set, open the Firefox browser from within the VM, and go to the dataexpo page.

You won’t need the entire data set, so start with a single year, 1987. When you’re about to download, select the Open with Archive Manager option.

After your file has downloaded, extract the file into your home directory where you’ll easily be able to find it. Click on the Extract button, and then select the Desktop directory.

Step 6: Copying the sample data set into HDFS

Remember that your Hadoop programs can only work with data after it’s stored in HDFS. So what you’re going to do now is copy the flight data file for 1987 into HDFS. Enter the following command:

hdfs dfs -copyFromLocal 1987.csv /user/root