Layer 2 of the Big Data Stack: Operational Databases

By Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman

At the core of any big data environment, and layer 2 of the big data stack, are the database engines containing the collections of data elements relevant to your business. These engines need to be fast, scalable, and rock solid. They are not all created equal, and certain big data environments will fare better with one engine than another, or more likely with a mix of database engines.

For example, although it is possible to use relational database management systems (RDBMSs) for all your big data implementations, it is not practical to do so because of performance, scale, or even cost. A number of different database technologies are available, and you must take care to choose wisely.

No single right choice exists regarding database languages. Although SQL is the most prevalent database query language in use today, other languages may provide a more effective or efficient way of solving your big data challenges. It is useful to think of the engines and languages as tools in an “implementer’s toolbox.” Your job is to choose the right tool.

For example, if you use a relational model, you will probably use SQL to query it. However, you can also use alternative languages like Python or Java. It is very important to understand what types of data can be manipulated by the database and whether it supports true transactional behavior. Database designers describe this behavior with the acronym ACID. It stands for

  • Atomicity: A transaction is “all or nothing” when it is atomic. If any part of the transaction or the underlying system fails, the entire transaction fails.

  • Consistency: Only transactions with valid data will be performed on the database. If the data is corrupt or improper, the transaction will not complete and the data will not be written to the database.

  • Isolation: Multiple, simultaneous transactions will not interfere with each other. All valid transactions will execute until completed and in the order they were submitted for processing.

  • Durability: After the data from the transaction is written to the database, it stays there “forever.”

    Engine Query Language MapReduce Data Types Transactions Examples
    Relational SQL, Python, C No Typed ACID PostgreSQL, Oracle, DB/2
    Columnar Ruby Hadoop Predefined and typed Yes, if enabled HBase
    Graph Walking, Search, Cypher No Untyped ACID Neo4J
    Document Commands JavaScript Typed No MongoDB, CouchDB
    Key-value Lucene, Commands JavaScript BLOB, semityped No Riak, Redis

After you understand your requirements and understand what data you’re gathering, where to put it, and what to do with it, you need to organize it so that it can be consumed for analytics, reporting, or specific applications.