Your First Hadoop Program: Hello Hadoop! - dummies

Your First Hadoop Program: Hello Hadoop!

By Dirk deRoos

After the Hadoop cluster is installed and running, you can run your first Hadoop program. This application is very simple, and calculates the total miles flown for all flights flown in one year. The year is defined by the data file you read in your application.

To keep things a bit simpler here, you’ll run a Pig script to calculate the total miles flown. You will see the map and reduce phases fly by in the output.

Here is the code for this Pig script:

records = LOAD '2013_subset.csv' USING PigStorage(',') AS              
Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,              LateAircraftDelay);milage_recs = 
GROUP records ALL;tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);STORE tot_miles INTO /user/root/totalmiles;

You want to put this code in a file on your VM, so first create a file. Right-click on the desktop of your VM and select Create Document from the contextual menu that appears and name the document. Then open the document in an editor, paste in the code, and save the file.

From the command line, run the following command to run the Pig script:

pig totalmiles.pig

You will see many lines of output, and then finally a “Success!” message, followed by more statistics, and then finally the command prompt. After your Pig job has completed, you can see your output:

hdfs dfs -cat /user/root/totalmiles/part-r-00000

Drumroll, please… And the answer is: 775009272

And with that, you’ve run your first Hadoop application!