Scaling NoSQL - dummies

By Adam Fowler

One common feature of NoSQL systems is their ability to scale across many commodity servers. These relatively cheap platforms mean that you can scale up databases by adding a new server rather than replace old hardware with new, more powerful hardware in a single shot.

There are high‐volume use cases that will quickly force you to scale out. These include

  • You receive status reports and log messages from across an IT landscape. This scenario requires fast ingest times, but it probably doesn’t require advanced analysis support.

  • You want high‐speed caching for complex queries. Maybe you want to get the latest news stories on a website. Here, read caches take prominence over query or ingest speeds.

The one thing common to the performance of all NoSQL databases is that you can’t rely on published data — none of it — to figure out what the performance is likely to be on your data, for your own use case.

You certainly can’t rely on a particular database vendor’s promise on performance! Many vendors quote high ingest speeds against an artificial use case that is not a realistic use of their database, as proof of their database’s supremacy.

However, the problem is that these same studies may totally ignore query speed. What’s the point in storing data if you never use it?

These studies may also be done on systems where key features are disabled. Security indexes may not be enabled, or perhaps ACID transaction support is turned off during the study so that data is stored quickly, but there’s no guarantee that it’s safe.

This all means that you must do your own testing, which is easy enough, but be sure that the test is as close to your final system as possible. For example, there’s no point in testing a single server if you plan to scale to 20 servers. In particular, be sure to have an accurate mix of ingesting, modifying, and querying data.

Consider asking your NoSQL vendor these questions:

  • Can you ensure that all sizing and performance figures quoted are for systems that ensure ACID transactions during ingest that support real‐time indexing, and that include a realistic mix of ingest and read/query requests?

  • Does your product provide features that make it easy to increase a server’s capacity?

  • Does your product provide features that make it easy to remove unused server capacity?

  • Is your product’s data query speed limited by the amount of information that has to be cached in RAM?

  • Does your product use a memory map strategy that requires all indexes to be held in RAM for adequate performance (memory mapped means the maximum amount of data stored is the same as the amount of physical RAM installed)?

  • Can your database maintain sub‐second query response times while receiving high‐frequency updates?

  • Does the system ensure that no downtime is required to add or remove server capacity?

  • Does the system ensure that information is immediately available for query after it is added to the database?

  • Does the system ensure that security of data is maintained without adversely affecting query speed?

  • Does the system ensure that the database’s scale‐out and scale‐back capabilities are scriptable and that they will integrate to your chosen server provisioning software (for example, VMWare and Amazon Cloud Formation)?