Managing Big Data Technologies in a Hybrid Cloud

By Judith Hurwitz, Marcia Kaufman, Fern Halper, Daniel Kirsch

The term big data is used often in the world of hybrid cloud technology because of the ongoing need to process increasing amounts of data. The key fact about big data is that it exists at the tipping point of the workarounds that organizations have historically put in place to manage large volumes of complex data. Big data technologies allow people to actually analyze and utilize this data effectively.

Big data characteristics

Big data generally has three characteristics — volume, variety, and velocity:

  • Volume: Big data is big in volume. It generally refers to at least multiple terabytes of data. Many big data implementations are looking to analyze petabytes of information.

    Name Value
    Byte 100
    Gigabyte 109 bytes
    Terabyte 1012 bytes
    Petabyte 1015 bytes
    Exabyte 1018 bytes
  • Variety: Big data comes in different shapes and sizes. It includes these types of data:

    • Structured data is the typical kind of data that analysts are used to dealing with. It includes revenue and number of sales — the type of data you think about including in a database. Structured data is also being produced in new ways in products such as sensors and RFID tags.

    • Semistructured data has some structure to it but not in the way you think about tables in a database. It includes EDI formats and XML.

    • Unstructured data includes text, image, and audio, including any document, e-mail message, tweet, or blog internal to a company or on the Internet. Unstructured data accounts for about 80 percent of all data.

  • Velocity: This is the speed at which the data moves. Think about sensors capturing data every millisecond or data streams output from medical equipment. Big data often comes at you in a stream, so it has a real-time nature associated with it.

The cloud is an ideal place for big data because of its scalable storage, compute power, and elastic resources. The cloud model is large-scale; distributed computing and a number of frameworks and technologies have emerged to support this model, including

  • Apache Hadoop: An open source distributed computing platform written in Java. It is a software library that enables distributed processing across clusters of computers. It’s really a distributed file system. It creates a computer pool, each with a Hadoop file system. Hadoop was designed to deal with large amounts of complex data. The data can be structured, unstructured, or semistructured. Hadoop can run across a lot of servers that don’t share memory or disk. See Hadoop for more information.

  • MapReduce: A software framework introduced by Google to support distributed computing on large sets of data. It’s at the heart of what Hadoop is doing with big data and big data analytics. It’s designed to take advantage of cloud resources. This computing is done across numerous computers, called clusters, and each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs.

Big data databases

One important appeal of Hadoop is that it can handle different types of data. Parallel database management systems have been on the market for decades. They can support parallel execution because most of the tables are partitioned over the nodes in a cluster, and they can translate SQL commands into a plan that is divided across the nodes in the cluster. However, they mostly deal with structured data because it’s hard to fit unstructured, freeform data into the columns and rows in a relational model.

Hadoop has started a movement in what has been called NoSQL, meaning not only SQL. The term refers to a set of technologies that is different from relational database systems. One major difference is that they don’t use SQL. They are also designed for distributed data stores.

NoSQL doesn’t mean that people should not be using SQL. Rather, the idea is that, depending on what your problem is, relational databases and NoSQL databases can coexist in an organization. There are numerous examples of these kinds of databases, including the following:

  • Apache Cassandra: An open source distributed data management system originally developed by Facebook. It has no stringent structure requirements, so it can handle all different types of data. Experts claim it excels at high-volume, real-time transaction processing. Other open source databases include MongoDB, Apache CouchDB, and Apache HBase.

  • Amazon Simple DB: Amazon likens this database to a spreadsheet in that it has columns and rows with attributes and items stored in each. Unlike a spreadsheet, however, each cell can have multiple values, and each item can have its own set of associated attributes. Amazon then automatically indexes the data. Recently, Amazon announced Amazon Dynamo DB as a way to bring big data NoSQL to the cloud.

  • Google BigTable: This hybrid is sort of like one big table. Because tables can be large, they’re split at the row boundaries into tables, which might be hundreds of megabytes or so. MapReduce is often used for generating and modifying data stored in BigTable.