Managing Files with the Hadoop File System Commands
HDFS is one of the two main components of the Hadoop framework; the other is the computational paradigm known as MapReduce. A distributed file system is a file system that manages storage across a networked cluster of machines.
HDFS stores data in blocks, units whose default size is 64MB. Files that you want stored in HDFS need to be broken into block-size chunks that are then stored independently throughout the cluster. You can use the fsck line command to list the blocks that make up each file in HDFS, as follows:
% hadoop fsck / -files –blocks
Because Hadoop is written in Java, all interactions with HDFS are managed via the Java API. Keep in mind, though, that you don’t need to be a Java guru to work with files in HDFS. Several Hadoop interfaces built on top of the Java API are now in common use (and hide Java), but the simplest one is the command-line interface; use the command line to interact with HDFS in the examples provided.
You access the Hadoop file system shell by running one form of the hadoop command. All hadoop commands are invoked by the bin/hadoop script. (To retrieve a description of all hadoop commands, run the hadoop script without specifying any arguments.) The hadoop command has the syntax
hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]
The –config confdir option overwrites the default configuration directory ($HADOOP_HOME/conf), so you can easily customize your Hadoop environment configuration. The generic options and command options are a common set of options that are supported by several commands.
Hadoop file system shell commands (for command line interfaces) take uniform resource identifiers (URIs) as arguments. A URI is a string of characters that’s used to identify a name or a web resource.
The string can include a scheme name — a qualifier for the nature of the data source. For HDFS, the scheme name is hdfs, and for the local file system, the scheme name is file. If you don’t specify a scheme name, the default is the scheme name that’s specified in the configuration file. A file or directory in HDFS can be specified in a fully qualified way, such as in this example:
Or it can simply be /parent/child if the configuration file points to hdfs://namenodehost.
The Hadoop file system shell commands, which are similar to Linux file commands, have the following general syntax:
hadoop hdfs dfs –file_cmd
Readers with some prior Hadoop experience might ask, “But what about the hadoop fs command?” The fs command is deprecated in the Hadoop 0.2 release series, but it does still work in Hadoop 2. Use hdfs dfs instead.
As you might expect, you use the mkdir command to create a directory in HDFS, just as you would do on Linux or on Unix-based operating systems. Though HDFS has a default working directory, /user/$USER, where $USER is your login username, you need to create it yourself by using the syntax
$ hadoop hdfs dfs -mkdir /user/login_user_name
For example, to create a directory named “joanna”, run this mkdir command:
$ hadoop hdfs dfs -mkdir /user/joanna
Use the Hadoop put command to copy a file from your local file system to HDFS:
$ hadoop hdfs dfs -put file_name /user/login_user_name
For example, to copy a file named data.txt to this new directory, run the following put command:
$ hadoop hdfs dfs –put data.txt /user/joanna
Run the ls command to get an HDFS file listing:
$ hadoop hdfs dfs -ls . Found 2 items drwxr-xr-x - joanna supergroup 0 2013-06-30 12:25 /user/joanna -rw-r--r-- 1 joanna supergroup 118 2013-06-30 12:15 /user/joanna/data.txt
The file listing itself breaks down as described in this list:
Column 1 shows the file mode (“d” for directory and “–” for normal file, followed by the permissions). The three permission types — read (r), write (w), and execute (x) — are the same as you find on Linux- and Unix-based systems. The execute permission for a file is ignored because you cannot execute a file on HDFS. The permissions are grouped by owner, group, and public (everyone else).
Column 2 shows the replication factor for files. (The concept of replication doesn’t apply to directories.) The blocks that make up a file in HDFS are replicated to ensure fault tolerance. The replication factor, or the number of replicas that are kept for a specific file, is configurable. You can specify the replication factor when the file is created or later, via your application.
Columns 3 and 4 show the file owner and group. Supergroup is the name of the group of superusers, and a superuser is the user with the same identity as the NameNode process. If you start the NameNode, you’re the superuser for now. This is a special group — regular users will have their userids belong to a group without special characteristics — a group that’s simply defined by a Hadoop administrator.
Column 5 shows the size of the file, in bytes, or 0 if it’s a directory.
Columns 6 and 7 show the date and time of the last modification, respectively.
Column 8 shows the unqualified name (meaning that the scheme name isn’t specified) of the file or directory.
Use the Hadoop get command to copy a file from HDFS to your local file system:
$ hadoop hdfs dfs -get file_name /user/login_user_name
Use the Hadoop rm command to delete a file or an empty directory:
$ hadoop hdfs dfs -rm file_name /user/login_user_name
Use the hadoop hdfs dfs -help command to get detailed help for every option.