Document Databases in a Big Data Environment - dummies

Document Databases in a Big Data Environment

By Judith Hurwitz, Alan Nugent, Fern Halper, Marcia Kaufman

You find two kinds of document databases for big data projects. One is often described as a repository for full document-style content. The other is a database for storing document components for permanent storage as a static entity or for dynamic assembly of the parts of a document. The structure of the documents and their parts is provided by JavaScript Object Notation (JSON) and/or Binary JSON (BSON).

Document databases are most useful when you have to produce a lot of reports and they need to be dynamically assembled from elements that change frequently.

At its core, JSON is a data-interchange format, based on a subset of the JavaScript programming language. Although part of a programming language, it is textual in nature and very easy to read and write. It also has the advantage of being easy for computers to handle. Two basic structures exist in JSON, and they are supported by many, if not all, modern programming languages.

The first basic structure is a collection of name/value pairs, and they are represented programmatically as objects, records, keyed lists, and so on. The second basic structure is an ordered list of values, and they are represented programmatically as arrays, lists, or sequences. BSON is a binary serialization of JSON structures designed to increase performance and scalability.

MongoDB for big data

MongoDB is the project name for the “hu(mongo)us database” system. It is maintained by a company called 10gen as open source and is freely available under the GNU AGPL v3.0 license. Commercial licenses with full support are available from 10gen.

MongoDB is composed of databases containing “collections.” A collection is composed of “documents,” and each document is composed of fields. Just as in relational databases, you can index a collection.

Doing so increases the performance of data lookup. Unlike other databases, however, MongoDB returns something called a “cursor,” which serves as a pointer to the data. This is a very useful capability because it offers the option of counting or classifying the data without extracting it. Natively, MongoDB supports BSON, the binary implementation of JSON documents.

MongoDB is also an ecosystem consisting of the following elements:

  • High-availability and replication services for scaling across local and wide-area networks.

  • A grid-based file system, enabling the storage of large objects by dividing them among multiple documents.

  • MapReduce to support analytics and aggregation of different collections/documents.

  • A sharding service that distributes a single database across a cluster of servers in a single or in multiple data centers. The service is driven by a shard key. The shard key is used to distribute documents intelligently across multiple instances.

  • A querying service that supports ad hoc queries, distributed queries, and full-text search.

Effective MongoDB implementations include

  • High-volume content management

  • Social networking

  • Archiving

  • Real-time analytics

CouchDB for big data

Another very popular nonrelational database is CouchDB. Like MongoDB, CouchDB is open source. It is maintained by the Apache Software Foundation and is made available under the Apache License v2.0. Unlike MongoDB, CouchDB was designed to mimic the web in all respects.

For example, CouchDB is resilient to network dropouts and will continue to operate beautifully in areas where network connectivity is spotty. It is also at home on a smartphone or in a data center. This all comes with a few trade-offs. Because of the underlying web mimicry, CouchDB is high latency resulting in a preference for local data storage.

CouchDB is not well suited to smaller implementations. You must determine whether these trade-offs can be ignored as you begin your big data implementation.

CouchDB databases are composed of documents consisting of fields and attachments as well as a “description” of the document in the form of metadata that is automatically maintained by the system. The underlying technology features all ACID capabilities. The advantage in CouchDB over relational is that the data is packaged and ready for manipulation or storage rather than scattered across rows and tables.

CouchDB is also an ecosystem with the following capabilities:

  • Compaction: The databases are compressed to eliminate wasted space when a certain level of emptiness is reached. This helps performance and efficiency for persistence.

  • View model: A mechanism for filtering, organizing, and reporting on data utilizing a set of definitions that are stored as documents in the database. You find a one-to-many relationship of databases to views, so you can create many different ways of representing the data you have “sliced and diced.”

  • Replication and distributed services: Document storage is designed to provide bidirectional replication. Partial replicas can be maintained to support criteria-based distribution or migration to devices with limited connectivity. Native replication is peer based, but you can implement Master/Slave, Master/Master, and other types of replication modalities.

Effective CouchDB implementations include

  • High-volume content management

  • Scaling from smartphone to data center

  • Applications with limited or slow network connectivity