NoSQL Terms and Definitions - dummies

NoSQL Terms and Definitions

By Adam Fowler

Getting your head around NoSQL can be a bit hard. If you studied databases in school, you may have been indoctrinated in a relational way of thinking. Say database to most people, and they think relational database management system. This is natural because during the past 30 years, the RDBMS has been so dominant.

To aid you on this journey, here are some key terms that are prevalent, as well as what they mean when applied to NoSQL databases.

  • Database construction

    • Database: A single logical unit, potential spread over multiple machines, into which data can be added and that can be queried for data it contains.

      The relational term tablespace could also be applied to a NoSQL database or collection.

    • Data farm: A term from RDBMS referring to a set of read‐only replica sets stored across a managed cluster of machines.

      In an RDBMS, these typically can’t have machines added without down time. In NoSQL clusters, it’s desirable to quickly scale out.

    • Partition: A set of data to be stored together on a single node for processing efficiency, or to be replicated.

      Could also be used for querying. In this case, it can be thought of as a collection.

  • Database structure

    • Collection: A set of records, typically documents, that are grouped together. This is based not on a property within the record set, but within its metadata. Assigning a record to a collection is usually done at creation or update time.

    • Schema: In RDBMS and to a certain extent column stores. The structure of the data must be configured in the database before any data is loaded.

      In document databases, although any structure can be stored, it is sometimes better to limit the structures by enforcing schema, such as in an XML Schema Definition. NoSQL generally, though, is regarded as schema‐free, or as supporting variable schema.

  • Records

    • Record: A single atomic unit of data representation in the particular database being described.

      In an RDBMS, this would be a row, as it is in column stores. This could also be a value in a key‐value store, a document in a document store, or a subject (not triple) in a triple store.

    • Row: Atomic unit of record in an RDBMS or column store.

      Could be modeled as an element within a document store or as a map in a key‐value store.

    • Field: A single field within a record. A column in an RDBMS.

      May not be present in all records, but when present should be of the same type or structure.

    • Table: A single class of record. In Bigtable, they are also called tables. In a triple store, they may be called subject RDF types or named be graphs, depending on the context. In a document store, they may be collections.

  • Record associations

    • Primary key: A guaranteed unique value in a particular table that can be used to always reference a record. A key in a key‐value store, URI in a document store, or IRI in a triple or graph store.

    • Foreign key: A data value that indicates a record is related to a record in a different table or record set. Has the same value as the primary key in the related table.

    • Relationship: A link, or edge in graph theory, that indicates two records have a semantic link. The relationship can be between two records in the same or different tables.

      In RDBMS, it’s normally other tables, whereas in a triple store it’s common to relate subjects of the same type (people in a social graph, for example). Some databases, mainly graph stores, support adding metadata to the relationships.

  • Storage organization

    • Server: A single computer node within a cluster. Typically runs a single instance of a database server’s code.

    • Cluster: A physical grouping or servers that are managed together in the same data center to provide a single service. May replicate its databases to clusters in other data centers.

    • Normal form: A method of normalizing, or minimizing duplication, in data in an RDBMS.

      NoSQL databases typically lead to a denormalized data structure in order to provide faster querying or data access.

  • Replication technology

    • Disk replication: Transparent replication of data between nodes in a single cluster to provide high‐availability resilience in the case of a failure of a single node.

    • Database replication: Replication between databases in different clusters. Replicates all data in update order from one cluster to another. Always unidirectional.

    • Flexible replication: Provides application controlled replication of data between databases in different clusters. Updates may not arrive in the same order they were applied to the first database. Typically involves some custom processing, such as prioritization of data updates to be sent next. Can be bi‐directional with appropriate update conflict resolution code.

  • Search tools

    • Index: An ordered list of values present in a particular record.

    • Reverse index: An ordered list of values (terms), and a list of primary keys of records that use these terms.

      Provides for efficient unstructured text search and rapid aggregation functions and sorting when cached in memory.

    • Query: A set of criteria that results in a list of records that match the query exactly, returned in order of particular field value(s).

    • Search: A set of criteria that results in a relevancy‐ordered list that match the query.

      The search criteria may not require an exact match, instead returning a relevancy calculation weighted by closeness of the match to the criteria. This is what Google does when you perform a search.