Layer 3 of the Big Data Stack: Organizing Data Services and Tools
Organizing data services and tools, layer 3 of the big data stack, capture, validate, and assemble various big data elements into contextually relevant collections. Because big data is massive, techniques have evolved to process the data efficiently and seamlessly. MapReduce is one heavily used technique. Suffice it to say here that many of these organizing data services are MapReduce engines, specifically designed to optimize the organization of big data streams.
Organizing data services are, in reality, an ecosystem of tools and technologies that can be used to gather and assemble data in preparation for further processing. As such, the tools need to provide integration, translation, normalization, and scale. Technologies in this layer include the following:
A distributed file system: Necessary to accommodate the decomposition of data streams and to provide scale and storage capacity
Serialization services: Necessary for persistent data storage and multilanguage remote procedure calls (RPCs)
Coordination services: Necessary for building distributed applications (locking and so on)
Extract, transform, and load (ETL) tools: Necessary for the loading and conversion of structured and unstructured data into Hadoop
Workflow services: Necessary for scheduling jobs and providing a structure for synchronizing process elements across layers