Big Data Framework Technologies You Should Know to Get a Job in Big Data
Aside from storing information, there are several important frameworks for organizing, accessing, and analyzing big data. There are four important technologies that you need to be familiar with or skilled in, depending on the big data role you’re pursuing.
The Hadoop framework
The Hadoop framework is an Apache open-source project — not standalone technology, but a collection of technologies. Hadoop has many implementations used by popular big data vendors like, Amazon Web Services, Cloudera, Hortonworks, and MapR.
Hadoop allows for very high-speed processing of big data by using a MapReduce strategy. MapReduce is a programming model used to process large amounts of data across parallel clustered systems. This does its workloads on files that are stored within a files system framework, like the Hadoop Distributed File System (HDFS) or even structured datasets. As you may have guessed from the name MapReduce, there are two steps in the process:
Mapping: There is a master node that takes large jobs and maps those to smaller worker nodes to do the work. In some cases, a worker node could further simplify the workload to smaller nodes. (A map step is like a WHERE clause in a SQL statement.)
Reducing: When the work is done by the worker nodes, the master node collects the “answers” and assembles the results. (A reduce step is like a GROUP clause in a SQL statement.)
The power is in the parallelization (working multiple jobs at the same time) of the mapping step. You can sort through petabytes of data in hours instead of days, as would be the case for traditional database queries running SQL.
The objective of Hadoop is to take lots and lots of data and derive some set of answers, or results. This is done through a map/reduce process in parallel. The data is “mapped” according to some sorting algorithm and then “reduced” through an additional summary algorithm to derive a set of results. The magic is in the parallel part.
Many mapping jobs can be done at the same time across a network of computers, or nodes. The nodes are independent resources within a network of computer systems. By sharing the load, the job of sorting though massive amounts of data can be done quickly.
Pig and its language, Pig Latin (you can’t accuse geeks of having no sense of humor), are a platform for analyzing large datasets originally created at Yahoo! for access to Hadoop clusters and later moved to the Apache open-source community.
Pig Latin is the access language that is used to access the runtime environment of Pig. It’s designed to make the work of creating MapReduce jobs easier. You don’t have to build your own map and reduce functions, but it’s another language to learn.
The challenge for traditional database programmers who move to new technologies is that they have to learn new languages and paradigms, like Pig. They’ve been programming in SQL for years, and moving to more pure computer science models is a challenge. Enter the Hive.
Hive allows programmers comfortable with SQL to write Hive Query Language (HQL) to query Hadoop clusters. By using a language very similar to SQL, Hive can translate SQL type calls into Hadoop-speak, which makes the usability of Hadoop much more palatable to traditional RDMBS programmers.
Think of it as a translation engine. If a programmer doesn’t know how to program in Hadoop, but knows how to use SQL to access data, Hive acts as that bridge and translates SQL type calls into Hadoop.
Spark is an emerging platform which is also built upon HDFS. In addition to being able to leverage HDFS, Spark can access HBase, Cassandra, and other inputs. Spark leverages grid computing for large parallel processing and can store information in RAM, which provides ultra-fast access to data and compute resources for analysis.
Programmers can access Spark using Python, Scala, or Java. Spark can also be used in conjunction with graphing analytics like GraphX and MLib, which is Apache’s machine learning library.