Graph Databases in a Big Data Environment
The fundamental structure for graph databases in big data is called “node-relationship.” This structure is most useful when you must deal with highly interconnected data. Nodes and relationships support properties, a key-value pair where the data is stored.
These databases are navigated by following the relationships. This kind of storage and navigation is not possible in RDBMSs (relational database management systems) due to the rigid table structures and the inability to follow connections between the data wherever they might lead us. A graph database might be used to manage geographic data for oil exploration or to model and optimize a telecommunications provider’s networks.
One of the most widely used graph databases is Neo4J. It is an open source project licensed under the GNU public license v3.0. A supported, commercial version is provided by Neo Technology under the GNU AGPL v3.0 and commercial licensing.
Neo4J is an ACID transaction database offering high availability through clustering. It is a trustworthy and scalable database that is easy to model because of the node-relationship properties’ fundamental structure and how naturally it maps to our own human relationships. It does not require a schema, nor does it require data typing, so it is inherently very flexible.
With this flexibility comes a few limitations. Nodes cannot reference themselves directly. For example, you (as a node) cannot also be your own father or mother (as relationships), but you can be a father or mother. There might be real world cases where self-reference is required.
If so, a graph database is not the best solution since the rules about self-reference are strictly enforced. While the replication capability is very good, Neo4J can only replicate entire graphs, placing a limit on the overall size of the graph (approximately 34 billion of nodes and 34 billion relationships).
Important characteristics of Neo4J include the following:
Integration with other databases: Neo4J supports transaction management with rollback to allow seamless interoperability with nongraphing data stores.
Synchronization services: Neo4J supports event-driven behaviors via an event bus, periodic synchronization using itself, or an RDBMS as the master, and traditional batch synchronization.
Resiliency: Neo4J supports cold (that is, when database is not running) and hot (when it is running) backups, as well as a high-availability clustering mode. Standard alerts are available for integration with existing operations management systems.
Query language: Neo4J supports a declarative language called Cypher, designed specifically to query graphs and their components. Cypher commands are loosely based on SQL syntax and are targeted at ad hoc queries of the graph data.
Neo4J implementations are best suited for
Classification of biological or medical domains
Creating dynamic communities of practice or interest