Home 9 Resources 9 Software 9 Spark instructions

Spark

Spark is a big data programming framework. On Bridges-2 it can be used in interactive or batch mode.

Interactive session

 

This example shows how to use Spark in an interactive session in the RM-shared partition. This is what you want if you are learning Spark. For advanced scalable usage, see the Batch job section below.

Type the commands shown in bold.

[username@bridges2-login014 ~]$ interact

A command prompt will appear when your session begins
"Ctrl+d" or "exit" will end your session

--partition RM-small,RM-shared
salloc -J Interact --partition RM-small,RM-shared
salloc: Pending job allocation 208982
salloc: job 208982 queued and waiting for resources
salloc: job 208982 has been allocated resources
salloc: Granted job allocation 208982
salloc: Waiting for resource configuration
salloc: Nodes r009 are ready for job
[username@r009 ~]$ module load spark
[username@r009 ~]$ pyspark
Python 3.8.5 (default, Sep  4 2020, 07:30:14)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
21/02/17 15:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Python version 3.8.5 (default, Sep  4 2020 07:30:14)
SparkSession available as 'spark'.

In [1]: print(sc.parallelize(range(1000)).takeSample('false',10))
[830, 65, 73, 45, 494, 216, 990, 360, 648, 622]

In [2]: exit
[username@r009 ~]$ exit
exit
salloc: Relinquishing job allocation 208982
[username@bridges2-login014 ~]$

Batch job

The following example job file is on Bridges-2 at  /opt/packages/spark/scripts/spark-batch-example.job. It will start an arbitrarily large spark cluster in a multi-node job and run a script that you provide.

Use the sbatch command to submit this job script.  See more about submitting jobs and the sbatch command in the Running jobs section of the Bridges-2 User Guide.

##!/bin/bash
#SBATCH -N 3
#SBATCH -p RM

###
#
# Example Spark Batch Job Script
#
# sbatch /opt/packages/spark/scripts/spark-batch-example.job
# Adjust the -N argument to the number of required nodes
# Replace spark-example.py with your own script
#
###

/opt/packages/spark/scripts/spark-cluster-init.sh &
# wait a few seconds for the daemons to start
sleep 10

module load spark

# replace /opt/packages/spark/scripts/spark-example.py with your own script
$SPARK_HOME/bin/spark-submit --master spark://`hostname -s`.ib.bridges2.psc.edu:7077 /opt/packages/spark/scripts/spark-example.py

 

 

More in depth instruction in the use of Spark was part of the XSEDE Big Data training course. See https://www.psc.edu/resources/training/hpc-workshop-series/  for more information and training slides. Video of past workshops can be found on the XSEDE YouTube channel at https://www.youtube.com/c/XSEDETraining.