Spark is a big data programming framework. On Bridges-2 it can be used in interactive or batch mode.
This example shows how to use Spark in an interactive session in the RM-shared partition. This is what you want if you are learning Spark. For advanced scalable usage, see the Batch job section below.
Type the commands shown in bold.
[username@bridges2-login014 ~]$ interact A command prompt will appear when your session begins "Ctrl+d" or "exit" will end your session --partition RM-small,RM-shared salloc -J Interact --partition RM-small,RM-shared salloc: Pending job allocation 208982 salloc: job 208982 queued and waiting for resources salloc: job 208982 has been allocated resources salloc: Granted job allocation 208982 salloc: Waiting for resource configuration salloc: Nodes r009 are ready for job [username@r009 ~]$ module load spark [username@r009 ~]$ pyspark Python 3.8.5 (default, Sep 4 2020, 07:30:14) Type 'copyright', 'credits' or 'license' for more information IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help. 21/02/17 15:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Python version 3.8.5 (default, Sep 4 2020 07:30:14) SparkSession available as 'spark'. In : print(sc.parallelize(range(1000)).takeSample('false',10)) [830, 65, 73, 45, 494, 216, 990, 360, 648, 622] In : exit [username@r009 ~]$ exit exit salloc: Relinquishing job allocation 208982 [username@bridges2-login014 ~]$
The following example job file is on Bridges-2 at /opt/packages/spark/scripts/spark-batch-example.job. It will start an arbitrarily large spark cluster in a multi-node job and run a script that you provide.
sbatch command to submit this job script. See more about submitting jobs and the
sbatch command in the Running jobs section of the Bridges-2 User Guide.
##!/bin/bash #SBATCH -N 3 #SBATCH -p RM ### # # Example Spark Batch Job Script # # sbatch /opt/packages/spark/scripts/spark-batch-example.job # Adjust the -N argument to the number of required nodes # Replace spark-example.py with your own script # ### /opt/packages/spark/scripts/spark-cluster-init.sh & # wait a few seconds for the daemons to start sleep 10 module load spark # replace /opt/packages/spark/scripts/spark-example.py with your own script $SPARK_HOME/bin/spark-submit --master spark://`hostname -s`.ib.bridges2.psc.edu:7077 /opt/packages/spark/scripts/spark-example.py
More in depth instruction in the use of Spark was part of the XSEDE Big Data training course. See https://www.psc.edu/resources/training/hpc-workshop-series/ for more information and training slides. Video of past workshops can be found on the XSEDE YouTube channel at https://www.youtube.com/c/XSEDETraining.