Bridges User Guide

 

Hadoop and Spark

If you want to run Hadoop or Spark on Bridges, you should note that when you apply for your account.

Accessing the Hadoop /Spark cluster 

Request resources

Request access to Bridges' Hadoop cluster by completing the Hadoop/Spark request form.  You will be contacted within one business day about your request.  When your reservation is ready, you will be emailed a list of nodes in your reservation.  The lowest node is the namenode.

Connect to the Hadoop cluster

Log into bridges.psc.edu.  From there, ssh to your namenode.   You can now run or submit Hadoop jobs.

Monitor your jobs

If you like, you can monitor your Hadoop/Spark jobs while they run, including browsing the filesystem, by setting up a web proxy.   See the Hadoop Proxy Set-up Guide for instructions.

 

File Systems

/home

The /home file system, which contains your home directory, is available on all of these nodes.

HDFS

The Hadoop filesystem, HDFS, is available from all  Hadoop nodes. There is no explicit quota for the HDFS, but it uses the disk space on your reserved node. 

Files must reside in HDFS to be used in Hadoop jobs. Putting files into HDFS requires these steps:

  1. Transfer the files  to the namenode  with scp or sftp
  2. Format them for ingestion into HDFS
  3. Use the hadoop fs -put command to copy the files into HDFS.  This command distributes your data files across the cluster's datanodes. 

The hadoop fs command should be in your command path by default.

Documentation for the hadoop fs command lists other options. These options can be used to list your files in HDFS, delete HDFS files, copy files out of HDFS and other file operations.

To request the installation of data ingestion tools on the Hadoop cluster send email to This email address is being protected from spambots. You need JavaScript enabled to view it..

 

A Simple Hadoop Example

This section demonstrates how to run a MapReduce Java program on the Hadoop cluster. This is the standard paradigm for Hadoop jobs. If you want to run jobs using another framework or in other languages besides Java send email to This email address is being protected from spambots. You need JavaScript enabled to view it. for assistance.

Follow these steps to run a job on the Hadoop cluster. All the commands listed below should be in your command path by default. The variable HADOOP_HOME should be set for you also.

  1. Compile your Java MapReduce program with a command similar to:
    hadoop com.sun.tools.javac.Main WordCount WordCount.java
  2.  

    where:

    • WordCount is the name of the output directory where you want your class file to be put
    • WordCount.java is the name of your source file
  3. Create a jar file out of your class file with a command similar to:
    jar -cvf WordCount.jar -C WordCount/ .

    where:

    • WordCount.jar is the name of your output jar file
    • WordCount is the name of the directory which contains your class file
  4. Launch your Hadoop job with the hadoop command

    Once you have your jar file you can run the  hadoop command to launch your Hadoop job. Your hadoop command will be similar to

    hadoop jar WordCount.jar org.myorg.WordCount \/datasets/compleat.txt $MYOUTPUT

    where:

    • Wordcount.jar is the name of your jar file
    • org.myorg.WordCount specifies the folder hierarchy inside your jar file. Substitute the appropriate hierarchy for your jar file.
    • /datasets/compleat.txt is the path to your input file in the HDFS file system. This file must already exist in HDFS.
    • $MYOUTPUT is the path to your output file, which will be saved in the HDFS file system. You must set this variable to the output file path before you issue the hadoop command.

After you issue the hadoop command your job is controlled by the Hadoop scheduler to run on the datanodes. The scheduler is currently a stricty FIFO scheduler. If your job turnaround is not meeting your needs send email to This email address is being protected from spambots. You need JavaScript enabled to view it.

When your job finishes, the hadoop command will end and you will be returned to the system prompt.

 

Spark

The Spark data framework is available on Bridges. Spark, built on the HDFS filesystem,  extends the Hadoop MapReduce paradigm in several directions. It supports a wider variety of workflows than MapReduce. Most importantly, it allows you to process some or all of your data in memory if you choose. This enables very fast parallel processing of your data.

Python, Java and Scala are available for Spark applications. The pyspark interpreter is especially effective for interactive, exploratory tasks in Spark. To use Spark you must first load your data into Spark's highly efficient file structure called  Resilient Distributed Dataset (RDD).

Extensive online documentation is available at the  Spark web site. If you have questions about or encounter problems using Spark, send email to This email address is being protected from spambots. You need JavaScript enabled to view it..

 

Other Hadoop Technologies

An entire ecosystem of technologies has grown up around Hadoop, such as HBase and Hive.  To request the installation of a different package send email to This email address is being protected from spambots. You need JavaScript enabled to view it..

New on Bridges

GPUs to be allocated separately
Read more

Upgraded scratch file system installed
Read more

Omni-Path User Group

The Intel Omni-Path Architecture User Group is open to all interested users of Intel's Omni-Path technology.

More information on OPUG