Securing Your Data in Hadoop
As Hadoop enters the IT mainstream and starts getting used in a major way in production environments, the same security concerns that apply to IT systems such as databases will be applicable to Hadoop as well. In its early years, Hadoop was famously not designed with security in mind, but the addition of enterprise-strength security capabilities is a major part of Hadoop's coming of age. It's a necessary part as well: For many applications (such as finance), if you can't provide security assurances, you may be breaking the law.
This article focuses on three main aspects of securing information — aspects that apply to Hadoop as they would to any other IT system:
The first principle in IT security is to tightly control the boundaries between your system and the outside world. Because Hadoop is a distributed system spanning many computers, this is largely a networking problem. As a distributed computing platform, a Hadoop cluster has many individual computers, with each computer having a number of open ports and services.
As you might expect, this is a security nightmare, one that most administrators handle by keeping the cluster on an isolated network. The challenge comes when users need to run applications against Hadoop itself. Consider deploying edge nodes, with shared networking, to act as gateways between Hadoop and the outside world. This strategy presents security challenges, however. To meet this challenge, the Hortonworks team has started development of the Apache Knox project, which enables secure access to the Hadoop cluster's services.
A big part of the security discussion is controlling access. Where perimeter control is about minimizing access points, access control is ensuring that any access that does happen is secure.
At the front line of access control is authentication, which, in short, is validation that your users are who they say they are. The open source community has put a tremendous amount of work into this area, enabling the various components in the Apache Hadoop ecosystem to work with Kerberos, the well-regarded computer network authentication protocol. As of Spring 2014, both Hadoop 1 and Hadoop 2 releases are fully Kerberos-enabled. (Not every IT shop uses Kerberos, but other protocols, such as LDAP, have been applied to Hadoop by some Hadoop distribution vendors in their proprietary offerings.)
After your authentication services have validated the identity of a user, the next question is determining what information and behaviors this user is entitled to — authorization, in other words.
Currently, authorization in Hadoop is rather primitive, and is restricted to the POSIX-style read, write, and execute privileges at the file system level. However, significant efforts are under way to define classes of users (for example, user roles) and the administration of access control lists (ACLs).
The Hive project, for example, will soon have grant/revoke commands to enable administrators to define which users can access specific tables or views. To this end, the Cloudera team has been spearheading the Apache Knox project to manage the definition of user roles and their privileges for accessing data in Impala and Hive.
The final piece of the access control puzzle is tracking data access events, which is a core requirement for a number of information management regulatory standards, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry Data Security Standard (PCI DSS). Hadoop does a good job of storing audit information to record data access events, so a core requirement is already in place. To protect and manage that audit data, third-party tools are available, such as Cloudera's Navigator or IBM Guardium.
After ensuring that your data's defenses are in place by managing the perimeter and governing access, you can do still more in case a breach does happen. Encryption can be that last line of defense. For data on disk, active work is taking place in the Hadoop community to incorporate encryption as an option for any data stored in HDFS. Intel's distribution has an early jump on this because it has enabled encryption for data in HDFS by taking advantage of specialized encryption instructions in Intel CPUs used in Hadoop slave nodes. Third-party tools are also available to encrypt data in HDFS.
Because Hadoop is a distributed system relying heavily on network communication, encrypting data as it moves through the network is a critical part of this story. Back in Hadoop 1, the Hadoop Remote Procedure Call (RPC) system was enhanced to support encryption. This covers the communication involved in data processing, such as MapReduce, but for data movement and the web interfaces, Hadoop also uses TCP/IP and HTTP. Both of these have also been secured: Hadoop's HTTP server now supports HTTPS, and HDFS transfer operations can be configured to be encrypted.