Bridges User Guide
Bridges User Guide: Hadoop and Spark
The /home file system, which contains your home directory, is available on all Bridges’ Hadoop nodes.
The Hadoop filesystem, HDFS, is available from all Hadoop nodes. There is no explicit quota for the HDFS, but it uses your $SCRATCH disk space. Please delete any files you don’t need when your job has ended.
Files must reside in HDFS to be used in Hadoop jobs. Putting files into HDFS requires these steps:
- Transfer the files to the namenode with scp or sftp
- Format them for ingestion into HDFS
- Use the
hadoop fs -putcommand to copy the files into HDFS. This command distributes your data files across the cluster’s datanodes.
hadoop fs command should be in your command path by default.
Documentation for the hadoop fs command lists other options. These options can be used to list your files in HDFS, delete HDFS files, copy files out of HDFS and other file operations.
To request the installation of data ingestion tools on the Hadoop cluster send email to firstname.lastname@example.org.
Accessing the Hadoop /Spark cluster
To start using Hadoop and Spark with Yarn and HDFS on Bridges, connect to the login node and issue the following commands:
interact -N 3 # you will need to wait until resources are allocated to you before continuing module load hadoop start-hadoop.sh
Your cluster will be set up and you’ll be able to run hadoop and spark jobs. The cluster requires a minimum of three nodes (-N 3). Larger jobs may require a reservation. Please contact email@example.com if you would like to use more than 8 nodes or run for longer than 8 hours.
Please note that when your job ends your HDFS will be unavailable so be sure to retrieve any data you need before your job finishes.
Web interfaces are currently not available for interactive jobs but can be made available for reservations.
The Spark data framework is available on Bridges. Spark, built on the HDFS filesystem, extends the Hadoop MapReduce paradigm in several directions. It supports a wider variety of workflows than MapReduce. Most importantly, it allows you to process some or all of your data in memory if you choose. This enables very fast parallel processing of your data.
Python, Java and Scala are available for Spark applications. The
pyspark interpreter is especially effective for interactive, exploratory tasks in Spark. To use Spark you must first load your data into Spark’s highly efficient file structure called Resilient Distributed Dataset (RDD).
Spark example using Yarn
Here is an example command to run a Spark job using yarn. This example calculates pi using 10 iterations.
spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.11-2.1.0.jar 10
To view the full output:
yarn logs -applicationId yarnapplicationId
where yarnapplicationId is the yarn applicationId assigned by the cluster.
A simple Hadoop example
This section demonstrates how to run a MapReduce Java program on the Hadoop cluster. This is the standard paradigm for Hadoop jobs. If you want to run jobs using another framework or in other languages besides Java send email to firstname.lastname@example.org for assistance.
Follow these steps to run a job on the Hadoop cluster. All the commands listed below should be in your command path by default. The variable HADOOP_HOME should be set for you also.
- Compile your Java MapReduce program with a command similar to:
hadoop com.sun.tools.javac.Main WordCount WordCount.java
- WordCount is the name of the output directory where you want your class file to be put
- WordCount.java is the name of your source file
- Create a jar file out of your class file with a command similar to:
jar -cvf WordCount.jar -C WordCount/ .
- WordCount.jar is the name of your output jar file
- WordCount is the name of the directory which contains your class file
- Make an input directory in the HDFS, if it doesn’t already exist:
hdfs dfs -mkdir -p /datasets
- Transfer an input file to the /datasets directory in HDFS
hdfs dfs -put /home/training/hadoop/datasets/compleat.txt /datasets
- Launch your Hadoop job with the
hadoopcommandOnce you have your jar file you can run the
hadoopcommand to launch your Hadoop job. Your hadoop command will be similar to
hadoop jar WordCount.jar org.myorg.WordCount /datasets/compleat.txt $MYOUTPUT
- Wordcount.jar is the name of your jar file
- org.myorg.WordCount specifies the folder hierarchy inside your jar file. Substitute the appropriate hierarchy for your jar file.
- /datasets/compleat.txt is the path to your input file in the HDFS file system. This file must already exist in HDFS.
- $MYOUTPUT is the path to your output file, which will be saved in the HDFS file system. You must set this variable to the output file path before you issue the hadoop command.
After you issue the hadoop command your job is controlled by the Hadoop scheduler to run on the datanodes. The scheduler is currently a stricty FIFO scheduler. If your job turnaround is not meeting your needs send email to email@example.com.
When your job finishes, the hadoop command will end and you will be returned to the system prompt.
Other Hadoop technologies
An entire ecosystem of technologies has grown up around Hadoop, such as HBase and Hive. To request the installation of a different package send email to firstname.lastname@example.org.