Alternative Big Data Solutions - dummies

Alternative Big Data Solutions

By Lillian Pierson

Looking past Hadoop, you can see alternative big data solutions on the horizon. These solutions make it possible to work with big data in real-time or to use alternative database technologies to handle and process it. Here, you’re introduced to the real-time processing frameworks, then the Massively Parallel Processing (MPP) platforms, and finally the NoSQL databases that allow you to work with big data outside of the Hadoop environment.

You should be aware of something referred to as ACID compliance, short for Atomicity, Consistency, Isolation, and Durability compliance. ACID compliance is a standard by which accurate and reliable database transactions are guaranteed.

In big data solutions, most database systems are not ACID compliant, but this does not necessarily pose a major problem. That’s because most big data systems use Decision Support Systems (DSS) that batch process data before that data is read out. DSS are information systems that are used for organizational decision-support. Non-transactional DSS demonstrate no real ACID compliance requirements.

Real-time processing frameworks

Sometimes you might need to query big data streams in real-time . . . and you just can’t do this sort of thing using Hadoop. In these cases, use a real-time processing framework instead. A real-time processing framework is — as its name implies — a framework that is able to process data in real-time (or near real-time) as that data streams and flows into the system. Essentially, real-time processing frameworks are the antithesis of the batch processing frameworks you see deployed in Hadoop.

Real-time processing frameworks can be classified into the following two ­categories:

  • Frameworks that lower the overhead of MapReduce tasks to increase the overall time efficiency of the system: Solutions in this category include Apache Storm and Apache Spark for near–real-time stream processing.

  • Frameworks that deploy innovative querying methods to facilitate real-time querying of big data: Some solutions in this category include Google’s Dremel, Apache Drill, Shark for Apache Hive, and Cloudera’s Impala.

Real-time, stream processing frameworks are quite useful in a multitude of industries — from stock and financial market analyses to e-commerce optimizations, and from real-time fraud detection to optimized order logistics. Regardless of the industry in which you work, if your business is impacted by real-time data streams that are generated by humans, machines, or sensors, then a real-time processing framework would be helpful to you in optimizing and generating value for your organization.

Massively Parallel Processing (MPP) platforms

Massively Parallel Processing (MPP) platforms can be used instead of MapReduce as an alternative approach for distributed data processing. If your goal is to deploy parallel processing on a traditional data warehouse, then an MPP may be the perfect solution.

To understand how MPP compares to a standard MapReduce parallel processing framework, consider the following. MPP runs parallel computing tasks on costly, custom hardware, whereas MapReduce runs them on cheap commodity servers. Consequently, MPP processing capabilities are cost restrictive. This said, MPP is quicker and easier to use than standard MapReduce jobs. That’s because MPP can be queried using Structured Query Language (SQL), but native MapReduce jobs are controlled by the more complicated Java programming language.

Well-known MPP vendors and products include the old-school Teradata platform, plus newer solutions like EMC2’s Greenplum DCA, HP’s Vertica, IBM’s Netezza, and Oracle’s Exadata.

Introducing NoSQL databases

Traditional relational database management systems (RDBMS) aren’t equipped to handle big data demands. That’s because traditional relational databases are designed to handle only relational datasets that are constructed of data that’s stored in clean rows and columns and thus are capable of being queried via Structured Query Language (SQL).

RDBM systems are not capable of handling unstructured and semi-structured data. Moreover, RDBM systems simply don’t have the processing and handling capabilities that are needed for meeting big data volume and velocity requirements.

This is where NoSQL comes in. NoSQL databases, like MongoDB, are non-relational, distributed database systems that were designed to rise to the big data challenge. NoSQL databases step out past the traditional relational database architecture and offer a much more scalable, efficient solution.

NoSQL systems facilitate non-SQL data querying of non-relational or schema-free, semi-structured and unstructured data. In this way, NoSQL databases are able to handle the structured, semi-structured, and unstructured data sources that are common in big data systems.

NoSQL offers four categories of non-relational databases — graph databases, document databases, key-values stores, and column family stores. Since NoSQL offers native functionality for each of these separate types of data structures, it offers very efficient storage and retrieval functionality for most types of non-relational data. This adaptability and efficiency makes NoSQL an increasingly popular choice for handling big data and for overcoming processing challenges that come along with it.

There is somewhat of a debate about the significance of the name NoSQL. Some argue that NoSQL stands for Not Only SQL, while others argue that the acronym represents Non-SQL databases. The argument is rather complex and there is no real cut-and-dry answer. To keep things simple, just think of NoSQL as a class of non-relational database management systems that do not fall within the spectrum of RDBM systems that are queried using SQL.