What You Should Know about NoSQL to Get a Big Data Job - dummies

What You Should Know about NoSQL to Get a Big Data Job

By Jason Williamson

A vast amount of information is stored on RDBMSs, but what about all the other big data technology you hear about? The huge amount of data and the need to access it quickly, as well as store unstructured data, requires an array of other systems that enable speed and agility. The advent of Not Only SQL (NoSQL) provided users with a more flexible and scalable way to store and access data to accommodate the demands of big data.

Key-value-pair data stores

This system does not require a highly structured model like a relational system. The key-value-pair (KVP) system focuses on tables and keys, allows for great flexibility, and can grow to a very large sizes without sacrificing performance. This is called scale. Scaling, or adding millions or billions of items to a data store, can impact performance negatively in a traditional system. KVP stores that “scale well” can get very, very big and still perform fast.

A key is a identifier that is used to find a value, the thing you want to store. Together they’re considered a pair.

Say you want to store user preferences like favorite fruit, car, color, and sport. To access that information, you would simply query the key, which could have been retrieved from a browser cookie and retrieve that data.

The “system” in this case allows you to programmatically store and query the key-value-pair. Querying a key simply means looking it up and getting the value. The KVP system offers enormous flexibility for a situation like this where you don’t want to restrict storage choices. When you need to store billions of items of data, a traditional RDMS can perform poorly.

image0.png

KVP solutions for big data are designed to be highly scalable and resilient. These technologies are typically stored entirely in random access memory (RAM), so access is fast and doesn’t require the query to access data stored on a physical device like a disc drive, which takes much longer to access.

Grid computing is a concept of spreading jobs across many computers to get the jobs done faster, as well as provide a high level of availability or fault tolerance.

Prevalent KVP implementations include the following:

  • Amazon DynomoDB: A KVP NoSQL data store offered as a cloud service from Amazon.

  • FoundationDB: A KVP NoSQL data store that ensures ACID transactions.

  • MemcacheDB: A distributed (grid based) data store that resided in RAM.

  • Redis: A key-value cache with the capability to store all types of data — structured and unstructured. In the industry people refer to Redis as a data structure server.

  • Riak: An open-source NoSQL KVP based on concepts from the Amazon DynomoDB product.

Document-oriented databases

Document-oriented databases allow for the storage and retrieval of semistructured data — data that’s somewhere between unstructured (like a tweet) and structured. Web pages and documents are a great example of semistructured data.

Whereas the RDBMSs are oriented around tables and keys, the document-oriented systems use a document paradigm. Instead of storing data in rows and columns, the document model defines information in a document and stores that information logically. This is a very flexible and simplified approach to data storage and retrieval.

Many of these NoSQL document databases store data in JSON format.

Popular document-oriented implementations include the following:

  • Cassandra: A part of the Apache open-source project, this is a distributed (grid-based) document-oriented database system.

  • CouchDB: An open-source document-oriented database system that has ACID capabilities.

  • MarkLogic: A commercially available document-oriented database system touted as enterprise ready. This is highly secure, reliable, and used by many Fortune 1,000 companies for customer-facing processes.

  • MongoDB: Perhaps the leading NoSQL database system that uses a document-oriented approach. This is also open-source under the Apache license model.

Graph-oriented databases

This type of database uses concepts of nodes, edges, and properties to store information and relationships. Directed graphs are especially useful when thinking about complex relationships like schedules with multiple dependencies, or in a social network, where you need to store information about people and their connectedness.

Graph theory is the science of viewing mathematical models in terms of graphs to relate objects with one another. A social network picture is an example of a graph. Nodes are people, and the edge connects these people. The properties can define the edges, or relationships.

image1.png

This type of database storage is especially useful for sites like LinkedIn or Facebook.

One example of a popular graph technology is GraphDB, which is used to map special data relationships call RDF Triples. The tool takes objects and facts, and graphically relates them.

Column-oriented databases

These systems are distributed column-oriented data stores. They orient information not in rows, like traditional RDBMSs, but in columns. This allows for the natural grouping of data, which speeds up analysis.

Traditional RDBMSs must follow a very defined way to organize information called table normalization, which avoids repeating column types and breaks information down to an atomic nature. So, to assemble a report, a programmer has to link these atomic elements, using SQL, into groupings that make sense to a human. This can take a very long time when dealing with huge amounts of data. Traditional RDMBSs are much slower because they must make complicated and time-consuming linkages to assemble large swaths of data into reports.

Apache HBase is a popular distributed column-oriented data store that was modeled after Google’s Bigtable system. HBase is built upon the Hadoop file system. HBase allows fast access to tables that are billions of rows by millions of columns.

With column databases, you can aggregate information vertically in column families, allowing for fast access of massive amounts of data. Unlike relational models, which are row focused, large columns and summary data on those columns can be done much faster.

image2.png