How to Get Started with Apache Hive
There’s no better way to see what’s what than to install the Hive software and give it a test run. As with other technologies in the Hadoop ecosystem, it doesn’t take long to get started.
If you have the time and the network bandwidth, it’s always best to download an entire Apache Hadoop distribution with all the technologies integrated and ready to run.
If you take the full-distribution route, a popular approach for learning the ins and outs of Hive is to run your Hadoop distribution in a Linux virtual machine (VM) on a 64-bit-capable laptop with sufficient RAM. (Eight gigabytes or more of RAM tends to work well if Windows 7 is hosting your VM.)
You also need Java 6 or later and — of course — a supported operating system: Linux, Mac OS X, or Cygwin, to provide a Linux shell for Windows users.
The setup steps run something like this:
Download the latest Hive release.
You also need the Hadoop and MapReduce subsystems, so be sure to complete Step 2.
Download Hadoop version 1.2.1.
Using the commands in the following listing, place the releases in separate directories, and then uncompress and untar them.
(Untar is one of those pesky Unix terms which simply means to expand an archived software package.)
$ mkdir hadoop; cp hadoop-1.2.1.tar.gz hadoop; cd hadoop $ gunzip hadoop-1.2.1.tar.gz $ tar xvf *.tar $ mkdir hive; cp hive-0.11.0.tar.gz hive; cd hive $ gunzip hive-0.11.0.tar.gz $ tar xvf *.tar
Using the commands in the following listing, set up your Apache Hive environment variables, including HADOOP_HOME, JAVA_HOME, HIVE_HOME and PATH, in your shell profile script.
export HADOOP_HOME=/home/user/Hive/hadoop/hadoop-1.2.1 export JAVA_HOME=/opt/jdk export HIVE_HOME=/home/user/Hive/hive-0.11.0 export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin: $JAVA_HOME/bin:$PATH
Create the Hive configuration file that you’ll use to define specific Hive configuration settings.
The Apache Hive distribution includes a template configuration file that provides all default settings for Hive. To customize Hive for your environment, all you need to do is copy the template file to the file named hive-site.xml and edit it.
Using your favorite editor, modify the hive-site.xml file so that it only includes the hive.metastore.warehouse.dir property for now. When finished it will look like the XML file below. Note that the comments were removed to shorten the listing:
$ cd $HIVE_HOME/conf $ cp hive-default.xml.template hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- Hive Execution Parameters → <property> <name>hive.metastore.warehouse.dir</name> <value>/home/biadmin/Hive/warehouse</value> <description>location of default database for the warehouse</description> </property> </configuration>
Because you’re running Hive in stand-alone mode on a virtual machine rather than in a real-life Apache Hadoop cluster, configure the system to use local storage rather than the HDFS: Simply set the hive.metastore.warehouse.dir parameter. When you start a Hive client, the $HIVE_HOME environment variable tells the client that it should look for your configuration file (hive-site.xml) in the conf directory.
Both Hadoop and Hive support a local mode configuration. If you already have a Hadoop cluster configured and running, you need to set the hive.metastore.warehouse.dir configuration variable to the HDFS directory where you intend to store your Hive warehouse, set the mapred.job.tracker configuration variable to point to your Hadoop JobTracker, and (most likely) set up a distributed metastore.
That’s all you need to do to get started with Apache Hive!