Log Data with Flume in HDFS - dummies

Log Data with Flume in HDFS

By Dirk deRoos

Some of the data that ends up in the Hadoop Distributed File System (HDFS) might land there via database load operations or other types of batch processes, but what if you want to capture the data that’s flowing in high-throughput data streams, such as application log data? Apache Flume is the current standard way to do that easily, efficiently, and safely.

Apache Flume, another top-level project from the Apache Software Foundation, is a distributed system for aggregating and moving large amounts of streaming data from different sources to a centralized data store.

Put another way, Flume is designed for the continuous ingestion of data into HDFS. The data can be any kind of data, but Flume is particularly well-suited to handling log data, such as the log data from web servers. Units of the data that Flume processes are called events; an example of an event is a log record.

To understand how Flume works within a Hadoop cluster, you need to know that Flume runs as one or more agents, and that each agent has three pluggable components: sources, channels, and sinks:

  • Sources retrieve data and send it to channels.

  • Channels hold data queues and serve as conduits between sources and sinks, which is useful when the incoming flow rate exceeds the outgoing flow rate.

  • Sinks process data that was taken from channels and deliver it to a destination, such as HDFS.

    image0.jpg

An agent must have at least one of each component to run, and each agent is contained within its own instance of the Java Virtual Machine (JVM).

An event that is written to a channel by a source isn’t removed from that channel until a sink removes it by way of a transaction. If a network failure occurs, channels keep their events queued until the sinks can write them to the cluster. An in-memory channel can process events quickly, but it is volatile and cannot be recovered, whereas a file-based channel offers persistence and can be recovered in the event of failure.

Each agent can have several sources, channels, and sinks, and although a source can write to many channels, a sink can take data from only one channel.

An agent is just a JVM that’s running Flume, and the sinks for each agent node in the Hadoop cluster send data to collector nodes, which aggregate the data from many agents before writing it to HDFS, where it can be analyzed by other Hadoop tools.

Agents can be chained together so that the sink from one agent sends data to the source from another agent. Avro, Apache’s remote call-and-serialization framework, is the usual way of sending data across a network with Flume, because it serves as a useful tool for the efficient serialization or transformation of data into a compact binary format.

In the context of Flume, compatibility is important: An Avro event requires an Avro source, for example, and a sink must deliver events that are appropriate to the destination.

What makes this great chain of sources, channels, and sinks work is the Flume agent configuration, which is stored in a local text file that’s structured like a Java properties file. You can configure multiple agents in the same file. Look at an sample file, which is named flume-agent.conf — it’s set to configure an agent named shaman:

# Identify the components on agent shaman:
shaman.sources = netcat_s1
shaman.sinks = hdfs_w1
shaman.channels = in-mem_c1
# Configure the source:
shaman.sources.netcat_s1.type = netcat
shaman.sources.netcat_s1.bind = localhost
shaman.sources.netcat_s1.port = 44444
# Describe the sink:
shaman.sinks.hdfs_w1.type = hdfs
shaman.sinks.hdfs_w1.hdfs.path = hdfs://<path>
shaman.sinks.hdfs_w1.hdfs.writeFormat = Text
shaman.sinks.hdfs_w1.hdfs.fileType = DataStream
# Configure a channel that buffers events in memory:
shaman.channels.in-mem_c1.type = memory
shaman.channels.in-mem_c1.capacity = 20000
shaman.channels.in-mem_c1.transactionCapacity = 100
# Bind the source and sink to the channel:
shaman.sources.netcat_s1.channels = in-mem_c1
shaman.sinks.hdfs_w1.channels = in-mem_c1

The configuration file includes properties for each source, channel, and sink in the agent and specifies how they’re connected. In this example, agent shaman has a source that listens for data (messages to netcat) on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console.

This configuration file could have been used to define several agents; here, you’re configuring only one to keep things simple.

To start the agent, use a shell script called flume-ng, which is located in the bin directory of the Flume distribution. From the command line, issue the agent command, specifying the path to the configuration file and the agent name.

The following sample command starts the Flume agent:

flume-ng agent -f /<path to flume-agent.conf> -n shaman

The Flume agent’s log should have entries verifying that the source, channel, and sink started successfully.

To further test the configuration, you can telnet to port 44444 from another terminal and send Flume an event by entering an arbitrary text string. If all goes well, the original Flume terminal will output the event in a log message that you should be able to see in the agent’s log.