Bridges User Guide: Hadoop and Spark

If you want to run Hadoop or Spark on Bridges, you should note that when you apply for your account.

File systems


The /home file system, which contains your home directory, is available on all Bridges’ Hadoop nodes.


The Hadoop filesystem, HDFS, is available from all  Hadoop nodes. There is no explicit quota for the HDFS, but it uses your $SCRATCH disk space. Please delete any files you don’t need when your job has ended. 

Files must reside in HDFS to be used in Hadoop jobs. Putting files into HDFS requires these steps:

  1. Transfer the files  to the namenode  with scp or sftp
  2. Format them for ingestion into HDFS
  3. Use the hadoop fs -put command to copy the files into HDFS.  This command distributes your data files across the cluster’s datanodes. 

The hadoop fs command should be in your command path by default.

Documentation for the hadoop fs command lists other options. These options can be used to list your files in HDFS, delete HDFS files, copy files out of HDFS and other file operations.

To request the installation of data ingestion tools on the Hadoop cluster send email to

Accessing the Hadoop /Spark cluster 

 To start using Hadoop and Spark with Yarn and HDFS on Bridges, connect to the login node and issue the following commands:

interact -N 3
 # you will need to wait until resources are allocated to you before continuing
 module load hadoop

Your cluster will be set up and you’ll be able to run hadoop and spark jobs. The cluster requires a minimum of three nodes (-N 3). Larger jobs may require a reservation.  Please contact if you would like to use more than 8 nodes or run for longer than 8 hours.

Please note that when your job ends your HDFS will be unavailable so be sure to retrieve any data you need before your job finishes.

Web interfaces are currently not available for interactive jobs but can be made available for reservations.


The Spark data framework is available on Bridges. Spark, built on the HDFS filesystem,  extends the Hadoop MapReduce paradigm in several directions. It supports a wider variety of workflows than MapReduce. Most importantly, it allows you to process some or all of your data in memory if you choose. This enables very fast parallel processing of your data.

Python, Java and Scala are available for Spark applications. The pyspark interpreter is especially effective for interactive, exploratory tasks in Spark. To use Spark you must first load your data into Spark’s highly efficient file structure called  Resilient Distributed Dataset (RDD).

Extensive online documentation is available at the  Spark web site. If you have questions about or encounter problems using Spark, send email to

Spark example using Yarn

Here is an example command to run a Spark job using yarn.  This example calculates pi using 10 iterations.

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.11-2.1.0.jar 10

To view the full output:

yarn logs -applicationId yarnapplicationId

where yarnapplicationId is the yarn applicationId assigned by the cluster.


A simple Hadoop example

This section demonstrates how to run a MapReduce Java program on the Hadoop cluster. This is the standard paradigm for Hadoop jobs. If you want to run jobs using another framework or in other languages besides Java send email to for assistance.

Follow these steps to run a job on the Hadoop cluster. All the commands listed below should be in your command path by default. The variable HADOOP_HOME should be set for you also.

  1. Compile your Java MapReduce program with a command similar to:
    hadoop WordCount


    • WordCount is the name of the output directory where you want your class file to be put
    • is the name of your source file


  2. Create a jar file out of your class file with a command similar to:
    jar -cvf WordCount.jar -C WordCount/ .


    • WordCount.jar is the name of your output jar file
    • WordCount is the name of the directory which contains your class file
  3. Make an input directory in the HDFS, if it doesn’t already exist:
    hdfs dfs -mkdir -p /datasets
  4. Transfer an input file to the /datasets directory in HDFS
    hdfs dfs -put /home/training/hadoop/datasets/compleat.txt /datasets
  5. Launch your Hadoop job with the hadoop commandOnce you have your jar file you can run the  hadoop command to launch your Hadoop job. Your hadoop command will be similar to
    hadoop jar WordCount.jar org.myorg.WordCount /datasets/compleat.txt $MYOUTPUT


    • Wordcount.jar is the name of your jar file
    • org.myorg.WordCount specifies the folder hierarchy inside your jar file. Substitute the appropriate hierarchy for your jar file.
    • /datasets/compleat.txt is the path to your input file in the HDFS file system. This file must already exist in HDFS.
    • $MYOUTPUT is the path to your output file, which will be saved in the HDFS file system. You must set this variable to the output file path before you issue the hadoop command.

After you issue the hadoop command your job is controlled by the Hadoop scheduler to run on the datanodes. The scheduler is currently a stricty FIFO scheduler. If your job turnaround is not meeting your needs send email to

When your job finishes, the hadoop command will end and you will be returned to the system prompt.

Other Hadoop technologies

An entire ecosystem of technologies has grown up around Hadoop, such as HBase and Hive.  To request the installation of a different package send email to