Document Databases in a Big Data Environment
Document databases are most useful when you have to produce a lot of reports and they need to be dynamically assembled from elements that change frequently.
The first basic structure is a collection of name/value pairs, and they are represented programmatically as objects, records, keyed lists, and so on. The second basic structure is an ordered list of values, and they are represented programmatically as arrays, lists, or sequences. BSON is a binary serialization of JSON structures designed to increase performance and scalability.
MongoDB for big data
MongoDB is the project name for the hu(mongo)us database system. It is maintained by a company called 10gen as open source and is freely available under the GNU AGPL v3.0 license. Commercial licenses with full support are available from 10gen.
MongoDB is composed of databases containing collections. A collection is composed of documents, and each document is composed of fields. Just as in relational databases, you can index a collection.
Doing so increases the performance of data lookup. Unlike other databases, however, MongoDB returns something called a cursor, which serves as a pointer to the data. This is a very useful capability because it offers the option of counting or classifying the data without extracting it. Natively, MongoDB supports BSON, the binary implementation of JSON documents.
MongoDB is also an ecosystem consisting of the following elements:
High-availability and replication services for scaling across local and wide-area networks.
A grid-based file system, enabling the storage of large objects by dividing them among multiple documents.
MapReduce to support analytics and aggregation of different collections/documents.
A sharding service that distributes a single database across a cluster of servers in a single or in multiple data centers. The service is driven by a shard key. The shard key is used to distribute documents intelligently across multiple instances.
A querying service that supports ad hoc queries, distributed queries, and full-text search.
Effective MongoDB implementations include
High-volume content management
CouchDB for big data
Another very popular nonrelational database is CouchDB. Like MongoDB, CouchDB is open source. It is maintained by the Apache Software Foundation and is made available under the Apache License v2.0. Unlike MongoDB, CouchDB was designed to mimic the web in all respects.
For example, CouchDB is resilient to network dropouts and will continue to operate beautifully in areas where network connectivity is spotty. It is also at home on a smartphone or in a data center. This all comes with a few trade-offs. Because of the underlying web mimicry, CouchDB is high latency resulting in a preference for local data storage.
CouchDB is not well suited to smaller implementations. You must determine whether these trade-offs can be ignored as you begin your big data implementation.
CouchDB databases are composed of documents consisting of fields and attachments as well as a description of the document in the form of metadata that is automatically maintained by the system. The underlying technology features all ACID capabilities. The advantage in CouchDB over relational is that the data is packaged and ready for manipulation or storage rather than scattered across rows and tables.
CouchDB is also an ecosystem with the following capabilities:
Compaction: The databases are compressed to eliminate wasted space when a certain level of emptiness is reached. This helps performance and efficiency for persistence.
View model: A mechanism for filtering, organizing, and reporting on data utilizing a set of definitions that are stored as documents in the database. You find a one-to-many relationship of databases to views, so you can create many different ways of representing the data you have sliced and diced.
Replication and distributed services: Document storage is designed to provide bidirectional replication. Partial replicas can be maintained to support criteria-based distribution or migration to devices with limited connectivity. Native replication is peer based, but you can implement Master/Slave, Master/Master, and other types of replication modalities.
Effective CouchDB implementations include
High-volume content management
Scaling from smartphone to data center
Applications with limited or slow network connectivity