Users can run Hadoop on Gordon using the myHadoop infrastucture, which integrates configuration and Hadoop cluster setup within Gordon's normal job-scheduling environment. The Hadoop File System (HDFS) is built using the high-performance flash drives (SSDs) mounted on each compute node via iSER. Internode communication is implemented using the IPoIB protocol.
The Hadoop distribution on Gordon is located in /opt/hadoop and usage examples are found in /opt/hadoop/contrib/myHadoop. The examples include:
The example scripts have a large number of parameters, configuration settings, and environment variables. Several of these are described in detail below, with user-modified content displayed in bold. Do not alter the order of content in the scripts because many of the steps depend on earlier definitions.
The following lines set the number of nodes (two nodes are used in this example). Note that the node count must be set consistently in the PBS and configure lines and that ppn
should always be used.
#PBS -l nodes=
2:ppn=1
$MY_HADOOP_HOME/bin/configure.sh -n
2 -c $HADOOP_CONF_DIR
export MY_HADOOP_HOME="/opt/hadoop/contrib/myHadoop"
export HADOOP_HOME="/opt/hadoop"
export HADOOP_CONF_DIR="/home/$USER/
config"
The only variable that is user specific is the HADOOP_CONF_DIR
. This is where the configuration files for the Hadoop cluster will be copied. Make sure this is a location readable by all the nodes in the cluster (i.e. a global location).
You can also set a wall clock limit, email notification, and account to charge:
#PBS -l walltime=1:00:00
#PBS -M EMAILADDRESS
#PBS -m abe
#PBS -A ACCOUNT
To ensure that the IPoIB (IP over InfiniBand) interface is used for the network, the following lines should be added to the script:
sed 's/$/.ibnet0/' $PBS_NODEFILE > $PBS_O_WORKDIR/hadoophosts.txt export PBS_NODEFILEZ=$PBS_O_WORKDIR/hadoophosts.txt
This adds the ibnet0
interface to the node list. The myHadoop configuration script looks for the list in the file pointed to by PBS_NODEFILEZ
. Without these lines, Hadoop would use the much slower 1GbE management network.
The configuration files are set up by the following command. Make sure (1), (2), and (3) are in the script before this step.
$MY_HADOOP_HOME/bin/configure.sh -n 2 -c $HADOOP_CONF_DIR
This step must follow (4):
sed -i 's@HADDTEMP@'$PBS_JOBID'@g' $HADOOP_CONF_DIR/hadoop-env.sh $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR namenode -format
At this point all the basic Hadoop configuration steps are complete. The cluster can now be started and used for Hadoop. For example, to start the Hadoop cluster:
$HADOOP_HOME/bin/start-all.sh
Copy a file to a directory created in HDFS from a filesystem:
$HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR dfs -mkdir Test $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR dfs -copyFromLocal ~/.bashrc Test
Once the jobs are complete the Hadoop cluster can be stopped as follows:
$HADOOP_HOME/bin/stop-all.sh
Keep in mind that the SSD drives are cleaned up once the job is complete. Hence, any data needed long term must be copied off the HDFS directories.