Hadoop Tutorial 2 - Dive into Hadoop distributed file system
In our Hadoop Tutorial 1, we introduce the big picture of Hadoop and HDFS architecure. And in this installing guide, we also learn how to set up a single node Hadoop on our computer. But we don't know how exactly HDFS works and how to run it by ourself.
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. The user guide primarily deals with the interaction of users and administrators with HDFS clusters. The HDFS architecture diagram depicts basic interactions among NameNode, the DataNodes, and the clients. Clients contact NameNode for file metadata or file modifications and perform actual file I/O directly with the DataNodes.
In this article, I will introduce the working knowledge of HDFS:
- Configuration of HDFS
- Common operations on HDFS
- Useful features of HDFS
1. Configuring HDFS
Hadoop configuration is driven by two types of important configuration files for the important components:
Read-only default | Site-specific |
---|---|
core-default.xml | core-site.xml |
hdfs-default.xml | hdfs-site.xml |
mapred-default.xml | mapred-site.xml |
yarn-default.xml | yarn-site.xml |
httpfs-default.xml | httpfs-site.xml |
User can override the default configurations by setting new values in site-specific XML files under the directory etc/hadoop. The hdfs-fault.xml can be found here. Here we list some common settings of HDFS.
1) Configuration for NameNode:
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/name</value>
<description>Determines where on the local filesystem the DFS name node
should store the name table(fsimage). If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy. </description>
</property>
2) Configuration for DataNode:
<property>
<name>dfs..datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored. </description>
</property>
3) Configuration for replication factor:
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time. </description>
</property>
2. Common operations on HDFS
There are two types of HDFS commands: User commands and Administration commands. The HDFS Command Guide describes most common useful commands for users and administrators of a hadoop cluster. All HDFS commands are invoked by the bin/hdfs
script. Running the hdfs
script without any arguments prints the description for all commands.
One of HDFS command modules is dfs, which provides basic file manipulations, such as loading and retrieving files, changing file permissions. The command bin/hdfs dfs -help
lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name
displays more detailed help for a command. The various COMMAND_OPTIONS can be found at File System Shell Guide.
Some examples:
Create directory | hdfs dfs -mkdir /user/hadoop/dir1 |
List files | hdfs dfs -ls /user/hadoop/file1 |
Upload files | hdfs dfs -put localfile /user/hadoop/hadoopfile |
3. Useful features of HDFS
The following is a set of useful features in HDFS. More details can be found in the HDFS User Guide.
- File permissions and authentication.
- Rack awareness: to take a node's physical location into account while scheduling tasks and allocating storage.
- Safemode: an administrative mode for maintenance.
fsck
: a utility to diagnose health of the file system, to find missing files or blocks.fetchdt
: a utility to fetch DelegationToken and store it in a file on the local system.- Balancer: tool to balance the cluster when the data is unevenly distributed among DataNodes.
- Upgrade and rollback: after a software upgrade, it is possible to rollback to HDFS' state before the upgrade in case of unexpected problems.
- Secondary NameNode: performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
- Checkpoint node: performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
- Backup node: An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.
Congratulations! You have finished the tutorial 2. I will post the following tutorials soon!