Bridges User Guide
Hadoop and Spark
If you want to run Hadoop or Spark on Bridges, you should note that when you apply for your account.
Accessing the Hadoop /Spark cluster
Request access to Bridges' Hadoop cluster by completing the Hadoop/Spark request form. You will be contacted within one business day about your request. When your reservation is ready, you will be emailed a list of nodes in your reservation. The lowest node is the namenode.
Connect to the Hadoop cluster
Log into bridges.psc.edu. From there, ssh to your namenode. You can now run or submit Hadoop jobs.
Monitor your jobs
If you like, you can monitor your Hadoop/Spark jobs while they run, including browsing the filesystem, by setting up a web proxy. See the Hadoop Proxy Set-up Guide for instructions.
The /home file system, which contains your home directory, is available on all of these nodes.
The Hadoop filesystem, HDFS, is available from all Hadoop nodes. There is no explicit quota for the HDFS, but it uses the disk space on your reserved node.
Files must reside in HDFS to be used in Hadoop jobs. Putting files into HDFS requires these steps:
- Transfer the files to the namenode with scp or sftp
- Format them for ingestion into HDFS
- Use the
hadoop fs -putcommand to copy the files into HDFS. This command distributes your data files across the cluster's datanodes.
hadoop fs command should be in your command path by default.
Documentation for the hadoop fs command lists other options. These options can be used to list your files in HDFS, delete HDFS files, copy files out of HDFS and other file operations.
A Simple Hadoop Example
Follow these steps to run a job on the Hadoop cluster. All the commands listed below should be in your command path by default. The variable HADOOP_HOME should be set for you also.
- Compile your Java MapReduce program with a command similar to:
hadoop com.sun.tools.javac.Main WordCount WordCount.java
- WordCount is the name of the output directory where you want your class file to be put
- WordCount.java is the name of your source file
- Create a jar file out of your class file with a command similar to:
jar -cvf WordCount.jar -C WordCount/ .
- WordCount.jar is the name of your output jar file
- WordCount is the name of the directory which contains your class file
- Launch your Hadoop job with the
Once you have your jar file you can run the
hadoopcommand to launch your Hadoop job. Your hadoop command will be similar to
hadoop jar WordCount.jar org.myorg.WordCount \/datasets/compleat.txt $MYOUTPUT
- Wordcount.jar is the name of your jar file
- org.myorg.WordCount specifies the folder hierarchy inside your jar file. Substitute the appropriate hierarchy for your jar file.
- /datasets/compleat.txt is the path to your input file in the HDFS file system. This file must already exist in HDFS.
- $MYOUTPUT is the path to your output file, which will be saved in the HDFS file system. You must set this variable to the output file path before you issue the hadoop command.
When your job finishes, the hadoop command will end and you will be returned to the system prompt.
The Spark data framework is available on Bridges. Spark, built on the HDFS filesystem, extends the Hadoop MapReduce paradigm in several directions. It supports a wider variety of workflows than MapReduce. Most importantly, it allows you to process some or all of your data in memory if you choose. This enables very fast parallel processing of your data.
Python, Java and Scala are available for Spark applications. The
pyspark interpreter is especially effective for interactive, exploratory tasks in Spark. To use Spark you must first load your data into Spark's highly efficient file structure called Resilient Distributed Dataset (RDD).
Other Hadoop Technologies