Document NoSQL Versus ECM - dummies

Document NoSQL Versus ECM

By Adam Fowler

Enterprise content management (ECM) systems have been around for more than ten years. Document NoSQL may offer some competition. Examples of ECM’s include IBM FileNet, DB2 Content Manager, and EMC Documentum. Many smaller companies, such as Stellent (now Oracle), have been incorporated into larger offerings.

A simplified ECM system, called Basic Content Services, also appeared — most commonly in Microsoft SharePoint. SharePoint’s emergence commoditized the ECM marketplace, drying up innovation, albeit at the benefit of lower license costs for customers.

ECM systems support document versioning — usually a major published version and a minor in-progress version, although some also supported a third “revision” number. These systems supported storing a document separate from its metadata and enforcing access to those documents and properties for read and write access.

Workflow support was also incorporated into most of the prominent ECM systems. These ranged from basic workflow — approval and updating of documents — to full end-to-end business process management, including process simulation and round-trip reengineering for continuous process improvement.

Records Management Systems (RMS) were often built on top of ECM systems, thereby allowing the application of retention rules to documents and further protecting them from modification. This is particularly useful, for example, if you’re in a regulated industry and need to preserve important documents during litigation or discovery.

These ECM systems typically stored the documents in a file system and the metadata in a relational database management system. The ECM systems were effectively middleware applications that could be clustered for high availability, but that relied on centralized database and file shares. They had limited scalability for very high speed ingests and were more aligned than toward fewer and larger important documents, such as office files and high-quality TIFF images from document scanning.

Document NoSQL databases are adding new functionality all the time. Their high scalability and ability to run on very cheap commodity servers means they cost even less than commoditized ECM systems.

Some NoSQL databases support storing multiple versions. Most of these databases are currently Bigtable clones, but some document databases do support this. MarkLogic Server has a Document Library Services (DLS) add-on that supports versioned storage of documents, although this isn’t visible in MarkLogic’s REST API.

MarkLogic Server also includes a Content Processing Framework (CPF). CPF is a state engine that moves a single document through a lifecycle and carries out actions based on the content, typically, to convert binary documents to XHTML and perform entity extraction using third-party tools.

These small feature sets may be adopted and extended by multiple NoSQL vendors in the future in order to provide the same engine-level features that Microsoft SharePoint and ECM systems provide. If so, document NoSQL databases may become the new storage and metadata engines behind ECM, which means increased throughput and lower costs for customers. It also promises embedded search from these databases in the ECM system itself. This will provide enhanced functionality compared to ECM systems alone currently.