Real-Time and Streaming Applications in Hadoop

By Dirk deRoos

The process flow of YARN looks an awful lot like a framework for batch execution. You might wonder, “What happened to this idea of flexibility for different modes of applications?” Well, the only application framework currently ready for production use is MapReduce. Soon, the Apache Tez and Apache Storm will be ready for production use, and you can use Hadoop for more than just batch processing.

Tez, for example, will support real-time applications — an interactive kind of application where the user expects an immediate response. One design goal of Tez is to provide an interactive facility for users to issue Hive queries and receive a result set in just a few seconds or less.

Another example of a non-batch type of application is Storm, which can analyze streaming data. This concept is completely different from either MapReduce or Tez, both of which operate against data that is already persisted to disk — in other words, data at rest. Storm processes data that hasn’t yet been stored to disk — more specifically, data that’s streaming into an organization’s network. It’s data in motion, in other words.

In both cases, the interactive and streaming-data processing goals wouldn’t work if Application Masters need to be instantiated, along with all the required containers. What YARN allows here is the concept of an ongoing service (a session), where there’s a dedicated Application Master that stays alive, waiting to coordinate requests. The Application Master also has open leases on reusable containers to execute any requests as they arrive.