Hadoop Tutorial 2 - Dive into Hadoop distributed file system

In our Hadoop Tutorial 1, we introduce the big picture of Hadoop and HDFS architecure. And in this installing guide, we also learn how to set up a single node Hadoop on our computer. But we don't know how exactly HDFS works and how to run it by ourself.

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. The user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.

In this article, I will introduce the working knowledge of HDFS:

Configuration of HDFS
Common operations on HDFS
Useful features of HDFS

1. Configuring HDFS

Hadoop configuration is driven by two types of important configuration files for the important components:

Read-only default	Site-specific
core-default.xml	core-site.xml
hdfs-default.xml	hdfs-site.xml
mapred-default.xml	mapred-site.xml
yarn-default.xml	yarn-site.xml
httpfs-default.xml	httpfs-site.xml

User can override the default configurations by setting new values in site-specific XML files under the directory etc/hadoop. The hdfs-fault.xml can be found here. Here we list some common settings of HDFS.

1) Configuration for NameNode:

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file://${hadoop.tmp.dir}/dfs/name</value>
    <description>Determines where on the local filesystem the DFS name node
      should store the name table(fsimage).  If this is a comma-delimited list
	  of directories then the name table is replicated in all of the
 	  directories, for redundancy. </description>
</property>

2) Configuration for DataNode:

<property>
    <name>dfs..datanode.data.dir</name>
    <value>file://${hadoop.tmp.dir}/dfs/data</value>
    <description>Determines where on the local filesystem an DFS data node
	  should store its blocks.  If this is a comma-delimited
	  list of directories, then data will be stored in all named
	  directories, typically on different devices.
	  Directories that do not exist are ignored. </description>
</property>

3) Configuration for replication factor:

<property>
    <name>dfs.replication</name>
    <value>3</value>
    <description>Default block replication. 
	  The actual number of replications can be specified when the file is created.
	  The default is used if replication is not specified in create time. </description>
</property>

2. Common operations on HDFS

There are two types of HDFS commands: User commands and Administration commands. The HDFS Command Guide describes most common useful commands for users and administrators of a hadoop cluster. All HDFS commands are invoked by the bin/hdfs script. Running the hdfs script without any arguments prints the description for all commands.

One of HDFS command modules is dfs, which provides basic file manipulations, such as loading and retrieving files, changing file permissions. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. The various COMMAND_OPTIONS can be found at File System Shell Guide.

Some examples:

Create directory	`hdfs dfs -mkdir /user/hadoop/dir1`
List files	`hdfs dfs -ls /user/hadoop/file1`
Upload files	`hdfs dfs -put localfile /user/hadoop/hadoopfile`

3. Useful features of HDFS

The following is a set of useful features in HDFS. More details can be found in the HDFS User Guide.

File permissions and authentication.
Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
Safemode: an administrative mode for maintenance.
fsck: a utility to diagnose health of the file system, to find missing files or blocks.
fetchdt: a utility to fetch DelegationToken and store it in a file on the local system.
Balancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.
Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.

Congratulations! You have finished the tutorial 2. I will post the following tutorials soon!