Bridges User Guide

 

Running Jobs

The SLURM scheduler  (Simple Linux Utility for Resource Management) manages and allocates all of Bridges' compute nodes. All of your production computing must be done on Bridges' compute nodes.  You cannot use Bridges' login nodes to do your work on.  

Several partitions have been set up in SLURM to allocate resources efficiently.  Partitions can be considered job queues.  Different partitions control different types of Bridges' resources;  they are configured by the type of node and other job requirements.  You will choose the appropriate partition to run your jobs in based on the resources you need.

Regardless of which partition you use, you can work on Bridges in either  interactive mode - where you type commands and receive output back to your screen as the commands complete- or batch mode - where you first create a batch (or job) script which contains the commands to be run, then submit the job to be run as soon as resources are available.

This document contains the following sections:

Partitions

Each SLURM partition manages a subset of Bridges' resources.  Each partition allocates resources to both interactive sessions and batch jobs that request resources from it.  There are five partitions organized by the type of resource they control:

  • RM, for jobs that will run on Bridges' RSM (128GB) nodes.
  • RM-shared, for jobs that will run on Bridges' RSM (128GB) nodes, but share a node with other jobs.
  • GPU, for jobs that will run on Bridges' GPU nodes.
  • GPU-shared, for jobs that will run on Bridges' GPU nodes, but share a node with other jobs
  • LM, for jobs that will run on Bridges' LSM and ESM (3TB and 12TB) nodes.

All the partitions use FIFO scheduling. If the top job in the partition will not fit on the machine, SLURM will skip that job and try to schedule the next job in the partition.

Note:  To make the most of your allocation, use the shared partitions whenever possible.  Jobs in the RM and GPU partitions are charged for the use of all cores on a node.  Jobs in the RM-shared and GPU-shared partitions share nodes, and are only charged for the cores they are allocated. The RM partition is the default for the sbatch command, while RM-shared is the default for the interact command. The interact and sbatch commands are discussed below.

This table summarizes the resources available and limits on Bridges' partitions.  More information on each partition follows.

PartitionNode typeNodes shared?Node defaultNode maxCore defaultCore maxGPU defaultGPU maxWalltime defaultWalltime maxMemory 
RM 128GB
28 cores
8TB on-node storage
No 1 168
If your research needs more than 168 nodes, contact This email address is being protected from spambots. You need JavaScript enabled to view it. to make special arrangements.
28/node 28/node N/A N/A 30 min 48 hrs 128GB/node
RM-shared 128GB
28 cores
8TB on-node storage
Yes 1 1 1 28 N/A N/A 30 min 48 hrs 4.5GB/core
GPU P100 nodes
2 GPUs
2 16-core CPUs
8 TB on-node storage
No 1 16 32/node 32/node 2 per node 2 per node 30 min 48 hrs 128GB/node
K80 nodes
128GB
4 GPUs
2 14-core CPUS
8TB on-node storage
No 1 8 28/node 28/node 4 per node 4 per node 30 min 48 hrs 128GB/node
GPU-shared P100 nodes
2 GPUs
2 16-core CPUS
8 TB on-node storage
Yes 1 1 16/GPU 16/GPU No default 2 30 min 48 hrs 7GB/GPU
K80 nodes
4 GPUs
2 14-core CPUs
8 TB on-node storage
Yes 1 1 7/GPU 7/GPU No default 4 30 min 48 hrs 7GB/GPU
LM LSM nodes
3TB RAM
16TB on-node storage

ESM nodes
12TB RAM
64TB on-node storage
Yes 1

42 for 3-TB nodes

4 for12-TB nodes

Jobs in LM are allocated 1 core/48GB of memory requested. N/A N/A 30 min 14 days up to 12000GB

 

Partition summaries

 

  • RM

  • RM-shared

  • GPU

  • GPU-shared

  • LM

RM partition

Jobs in the RM partition run on Bridges' RSM (128GB) nodes.  Jobs do not share nodes, and are allocated all 28 of the cores on each of the nodes assigned to them.  A job in the RM partition is charged for all 28 cores per node on its assigned nodes. 

RM jobs can use more than one node.  However, the memory space of  all the nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

The internode communication performance for jobs in the RM partition is best when using 42 or fewer nodes. 

When submitting a job to the RM partition, you should specify:

  • the number of  nodes
  • the walltime limit 

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands. 

 

RM-shared partition

Jobs in the RM-shared partition run on Bridges' RSM (128GB) nodes.  Jobs will share nodes, but not cores.   A job in the RM-shared partition will be charged only for the cores allocated to it, so it will use fewer SUs than a RM job.  It could also start running sooner.

RM-shared jobs are assigned memory in proportion to the number of requested cores.   They get the fraction of the node's total memory in proportion to the number of cores requested. If the job exceeds this amount of memory it will be killed.

When submitting a job to the RM-shared partition, you should specify:

  • the number of cores
  • the walltime limit

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands.

 

GPU partition

Jobs in the GPU partition use Bridges' GPU nodes.  Note that Bridges has 2 types of GPU nodes: K80s and P100s.  See the System Configuration section of this User Guide for the details of each type.

Jobs in the  GPU partition do not share nodes, so  jobs are allocated all the cores associated with the nodes assigned to them and all of the GPUs. Your job will be charged for all the cores associated with your assigned nodes.

However, the memory space across nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

When submitting a job to the GPU partition, you must specify the number of GPUs.

You should also specify:

  • the type of node you want, K80or P100, with the --gres=type option to the interact or sbatch commands.  K80 is the default if no type is specified.  See the sbatch command options below for more details.
  • the number of nodes
  • the walltime limit 

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands. 

 

GPU-shared partition

Jobs in the GPU-shared partition run on Bridges's GPU nodes.  Note that Bridges has 2 types of GPU nodes: K80s and P100s.  See the System Configuration section of this User Guide for the details of each type.

Jobs in the GPU-shared partition share nodes, but not cores. By sharing nodes your job will be charged less.  It could also start running sooner.

You will always run on (part of) one node in the GPU-shared partition.

Your jobs will be allocated memory in proportion to the number of requested GPUs. You get the fraction of the node's total memory in proportion to the fraction of GPUs you requested. If your job exceeds this amount of memory it will be killed.

When submitting a job to the GPU-shared partition, you must specify the number of GPUs.  

You should also specify:

  • the type of node you want, K80or P100, with the --gres=type option to the interact or sbatch commands.  K80 is the default if no type is specified.  See the sbatch command options below for more details.
  • the walltime limit

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands.

 

LM partition

Jobs in the LM partition always share nodes. They never span nodes.

When submitting a job to the LM partition, you must specify

  • the amount of memory in GB  - any value up to 12000GB can be requested
  • the walltime limit  

The number of cores assigned to jobs in the LM partition is proportional to the amount of memory requested. For every 48 GB of memory you will be allocated 1 core.

SLURM will place jobs on either a 3TB or a 12TB node based on the memory request.  Jobs asking for 3000GB or less will run on a 3TB node.  If no 3TB nodes are available but a 12TB node is available, the job will run on a 12TB node.

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands.

 

 

Interactive sessions

You must  be allocated the use of one or more Bridges' compute nodes by SLURM to work interactively on Bridges.  You cannot use the Bridges login nodes for your work.

You can run an interactive session in any of the SLURM partitions.  You will need to specify which partition you want,  so that the proper resources are allocated for your use.

Resources are set aside for interactive use. If those resources are all in use, your request will wait until the resources you need are available. Using a shared partition (RM-shared, GPU-shared) will probably allow your job to start sooner.

The interact command

To start an interactive session, use the command interact.  The format is

interact -options

The simplest interact command is

 interact

This command will start an interactive job using the defaults for interact, which are:

Partition: RM-shared
Cores: 1
Time limit: 60 minutes

 

The simplest interact command to start a GPU job is

 interact -gpu

This command will start an interactive job on a P100 node in the GPU-shared partition with 1 GPU and for 60 minutes.

 

Once the interact command returns with a command prompt you can enter your commands. The shell will be your default shell. When you are finished with your job type CTRL-D.

You will be charged for your resource usage from the time your job starts until you type CTRL-D, so be sure to type CTRL-D as soon as you are done.   

The maximum time you can request is 8 hours. Inactive interact jobs are logged out after 30 minutes of idle time.

Options for interact 

If you want to run in a different partition, use more than one core or set a different time limit, you will need to use options to the interact command. 

The available options are:

OptionDescriptionDefault value
-p partition Partition requested RM-shared
-t HH:MM:SS

Walltime requested 

The maximum time you can request is 8 hours.

60:00 (1 hour)
-N n Number of nodes requested 1
-A groupname Group to charge the job to Your default group
 Find your default group
-R reservation-name Reservation name, if you have one
Use of -R does not automatically set any other interact options. You still need to specify the other options (partition, walltime, number of nodes) to override the defaults for the interact command.
No default
--mem=nGB
Note the "--" for this option

Amount of memory requested in GB. This option should only
 be used for the LM partition.

No default
--gres=gpu:type:n
Note the "--" for this option

'type' is either p100 or k80. The default is k80.

'n' is the number of GPUs.  Valid choices are 

  • 1-4, when type=k80
  • 1-2, when type=p100
No default
-gpu Runs your job on 1 P100 GPU in the GPU-shared partition No default
--ntasks-per-node=n
Note the "--" for this option
Number of cores to allocate per node 1
-h Help, lists all the available command options  

Sample interact commands

Run in the RM-shared partition using 4 cores 

interact --ntasks-per-node=4

Run in the LM partition and request 2TB of memory

interact -p LM --mem=2000GB

Run in the GPU-shared partition and ask for 2 P100 GPUs.

interact -p GPU-shared --gres=gpu:p100:2

If you want more complex control over your interactive job you can use the srun command instead of the interact command.

 See the srun man page.

 

Batch jobs

To run a batch job, you must first create a batch (or job) script, and then submit the script  using the sbatch command.  

A batch script is a file that consists of SBATCH directives, executable commands and comments.

SBATCH directives specify  your resource requests and other job options in your batch script.  You can also specify resource requests and options  on the sbatch command line.  Any options in the batch script take precedence over those given on the command line. The SBATCH directives must start in column 1 (that is, be the first text on a line, with no leading spaces) with '#SBATCH'.

Comments begin with a '#' character.

The first line of any batch script must indicate the shell to use for your batch job.

 

Sample batch scripts

Some sample scripts are given here.  Note that:

Each script uses the bash shell, indicated by the first line '!#/bin/bash'.  Some Unix commands will differ if you use another shell.

For username and groupname you must substitute your username and your appropriate group.

 

  • OpenMP job

  • MPI job

  • Hybrid OpenMP/MPI Job

  • RM-shared partition

  • GPU partition

  • GPU-shared partition

  • Bundle single-core jobs

  • Bundle multi-core jobs

 

Sample batch script for OpenMP job

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM
#SBATCH --ntasks-per-node 28 #SBATCH -t 5:00:00
# echo commands to stdout set -x
# move to working directory cd /pylon5/groupname/username # copy input file from your pylon2 directory
# to the working directory cp /pylon2/groupname/username/input.data .
# run OpenMP program export OMP_NUM_THREADS=28 ./myopenmp
# copy output file to persistent space cp output.data /pylon2/groupname/username

Notes:

        The --ntasks-per-node option indicates that you will use all 28 cores.

For username and groupname you must substitute your username and your appropriate group.

 

Sample batch script for MPI job

#!/bin/bash
#SBATCH -p RM
#SBATCH -t 5:00:00
#SBATCH -N 2
#SBATCH --ntasks-per-node 28
#echo commands to stdout set -x #move to working directory cd /pylon5/groupname/username
#copy input files to LOCAL file storage srun -N $SLURM_NNODES --ntasks-per-node=1 \
sh -c 'cp /pylon2/groupname/username/input.${SLURM_PROCID} $LOCAL'
#run MPI program mpirun -np $SLURM_NTASKS ./mympi
#copy output files to persistent space srun -N $SLURM_NNODES --ntasks-per-node=1 \
sh -c 'cp $LOCAL/output.* /pylon2/groupname/username'

Notes:

The variable $SLURM_NTASKS gives the total number of cores requested in a job. In this example $SLURM_NTASKS will be 56 because  the -N option requested 2 nodes and the --ntasks-per-node option requested all 28 cores on each node .

The srun commands are used to copy files between pylon2 and the $LOCAL file systems on each of your nodes.

The first srun command assumes you have two files named input.0 and input.1 in your pylon2 file space. It will copy input.0 and input.1 to, respectively, the $LOCAL file systems on the first and second nodes allocated to your job.

The second srun command will copy files named output.* back from your $LOCAL file systems to your pylon2 file space before your job ends. In this command '*' functions as the usual Unix wildcard.

For username and groupname you must substitute your username and your appropriate group.

Sample batch script for hybrid OpenMP/MPI job

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=14
#SBATCH --time=00:10:00
#SBATCH --job-name=hybrid
cd $SLURM_SUBMIT_DIR
mpiifort -xHOST -O3 -qopenmp -mt_mpi hello_hybrid.f90 -o hello_hybrid.exe
mpirun -print-rank-map -n $SLURM_NTASKS -genv \
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK -genv I_MPI_PIN_DOMAIN=omp \
./hello_hybrid.exe

Notes:

   This example asks for 2 nodes, 4 MPI tasks and 14 OpenMP threads per MPI task.

 

Sample batch script for RM-shared partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node 2
#echo commands to stdout
set -x

#move to working directory cd /pylon5/groupname/username
#copy input file from your pylon2 space
# to the working directory cp /pylon2/groupname/username/input.data .
#run OpenMP program export OMP_NUM_THREADS 2 ./myopenmp
#copy output file to persistent space cp output.data /pylon2/groupname/username

Notes:

When using the RM-shared partition the number of nodes requested with the -N option must always be 1. The --ntasks-per-node option indicates how many cores you want.

For username and groupname you must substitute your username and your appropriate group.

 

 

Sample batch script for GPU partition

#!/bin/bash
#SBATCH -N 2
#SBATCH -p GPU
#SBATCH --ntasks-per-node 28
#SBATCH -t 5:00:00
#SBATCH --gres=gpu:p100:2
#echo commands to stdout
set -x

#move to working directory cd /pylon5/groupname/username
#copy the input file from your pylon2 space
# to the working directory cp /pylon2/groupname/username/input.data .
#run GPU program ./mygpu
#copy output file to persistent storage cp output.data /pylon2/groupname/username

Notes:

The value of the --gres-gpu option indicates the type and number of GPUs you want.

For username and groupname you must substitute your username and your appropriate group.

 

Sample batch script for GPU-shared partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH --ntasks-per-node 7 #SBATCH --gres=gpu:p100:1 #SBATCH -t 5:00:00
#echo commands to stdout set -x
#move to working directory cd /pylon5/groupname/username
#copy input file to working directory cp /pylon2/groupname/username/input.data .
#run GPU program ./mygpu
#copy output file to persistent storage cp output.data /pylon2/groupname/username

Notes:

The option --gres-gpu indicates the number and type of GPUs you want.

For username and groupname you must substitute your username and your appropriate group.

 

Sample batch script for bundling single-core jobs

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 05:00:00
#SBATCH --ntasks-per-node 14
 
echo SLURM NTASKS: $SLURM_NTASK
i=0
while [ $i -lt $SLURM_NTASKS ]
do
numactl -C +$i ./run.sh &
let i=i+1
done
wait # IMPORTANT: wait for all to finish or get killed  
Notes:
Bundling or packing multiple jobs in a single job can improve your
turnaround and improve the performance of the SLURM scheduler.

Sample batch script for bundling multi-core jobs

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 05:00:00
#SBATCH --ntasks-per-node 14
#SBATCH --cpus-per-task 2

echo SLURM NTASKS: $SLURM_NTASKS
i=0
while [ $i -lt $SLURM_NTASKS ]
do
numactl -C +$i ./run.sh &
let i=i+1
done wait # IMPORTANT: wait for all to finish or get killed

Notes:
Bundling or packing multiple jobs in a single job can improve your
turnaround and improve the performance of the SLURM scheduler.

 

The sbatch command

To submit a batch job,  use the sbatch command.  The format is

sbatch -options batch-script

The options to sbatch can either be in your batch script or on your sbatch command line.  Options in the command line override those in the batch script.

Note: in some cases, the options for sbatch differ from the options for interact or srun.

 

Examples of the sbatch command 

RM partition

An example of a sbatch command to submit a job to the RM partition is

sbatch -p RM -t 5:00:00 -N 1 myscript.job

where:

-p indicates the intended partition

-t is the walltime requested in the format HH:MM:SS

-N is the number of nodes requested

myscript.job is the name of your batch script

LM partition

Jobs submitted to the LM partition must request the amount of memory they need rather than the number of cores. Each core on the 3TB and 12TB nodes is associated with a fixed amount of memory, so the amount of memory you request determines the number of cores assigned to your job. The environment variable SLURM_NTASKS tells you the number of cores assigned to your job. Since there is no default memory value you must always include the --mem option for the LM partition.

A sample sbatch command for the LM partition is:

sbatch -p LM - t 10:00:00 --mem 2000GB myscript.job

where:

-p indicates the intended partition (LM)

-t is the walltime requested in the format HH:MM:SS

--mem is the amount of memory requested

myscript.job is the name of your batch script

Jobs in the LM partition do share nodes. They cannot span nodes. Your memory space for an LM job is an integrated, shared memory space.

 

Useful sbatch options

For more information about these options and other useful sbatch options see the sbatch man page

-p partition Partition requested. Defaults to the RM partition.
-t HH:MM:SS Walltime requested in HH:MM:SS
-N n Number of nodes requested.
-A groupname Group to charge the job to. If not specified, your default group is charged.  Find your default group
--res reservation-name Use the reservation that has been set up for you.  Use of --res does not automatically set any other options. You still need to specify the other options (partition, walltime, number of nodes) that you would in any sbatch command. Note the "--" for this option.
--mem=nGB Memory in GB. Note the "--" for this option. This option should only be used for the LM partition.
--gres=gpu:type:n Specifies the type and number of GPUs requested. 'type' is either p100 or k80. The default is k80.

'n' is the number of requested GPUs. Valid choices are 1-4, when type is k80  and 1-2 when type is p100.

Note the "--" for this option.

--ntasks-per-node=n Request n cores be allocated per node. Note the "--" for this option.
--mail-type=type Send email when job events occur, where type can be BEGIN, END, FAIL or ALL.
--mail-user=user User to send email to as specified by -mail-type. Default is the user who submits the job.
-d=dependency-list Set up dependencies between jobs, where dependency-list can be:
after:job_id[:jobid...]
This job can begin execution after the specified jobs have begun execution.
afterany:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated.
aftercorr:job_id[:jobid...]
A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero).
afternotok:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
afterok:job_id[:jobid...]
This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).
singleton
This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.
--no-requeue Specifies that your job will be not be requeued under any circumstances. If your job is running on a node that fails it will not be restarted. Note the "--" for this option.
--time-min=HH:MM:SS Specifies a minimum walltime for your job in HH:MM:SS format.

SLURM considers the walltime requested when deciding which job to start next. Free slots on the machine are defined by the number of nodes and how long those nodes are free until they will be needed by another job. By specifying a minimum walltime you allow the scheduler to reduce your walltime request to your specified minimum time when deciding whether to schedule your job. This could allow your job to start sooner.

If you use this option your actual walltime assignment can vary between your minimum time and the time you specified with the -t option. If your job hits its actual walltime limit, it will be killed. When you use this option you should checkpoint your job frequently to save the results obtained to that point.

--switches=1
--switches=1@HH:MM:SS
Requests that the nodes your job runs on all be on one switch, which is a hardware grouping of 42 nodes. If you are asking for more than 1 and fewer than 42 nodes, your job will run more efficiently if it runs on one switch.  Normally switches are shared across jobs, so using the switches option means your job may wait longer in the queue before it starts.

The optional time parameter gives a maximum time that your job will wait for a switch to be available. If it has waited this maximum time, the request for your job to be run on a switch will be cancelled.

-C=constraints 

Specifies features which the nodes allocated to this job must have. Some examples are:

-C LM
Ensures that a job in the LM partition uses only the 3TB nodes. This option is required for any jobs in the LM partition which use /pylon5.
-C PH1
Ensures that the job will run on LM nodes which have 16 cores and 48GB/core
-C PH2
Ensures that the job will run on LM nodes which have 20 cores and 38.5GB/core

Multiple constraints can be specified with AND, OR, etc. For example, -C "LM&PH2" constrains the nodes to 3TB nodes with 20 cores and 38.5GB/core. See the sbatch man page for further details.

-h Help, lists all the available command options

 

Other SLURM commands

 

sinfo

The sinfo command displays information about the state of Bridges's nodes. The nodes can have several states:

alloc Allocated to a job
down Down
drain Not available for scheduling
idle Free
resv Reserved

 

squeue

The squeue command displays information about the jobs in the partitions. Some useful options are:

-j jobid Displays the information for the specified jobid
-u username Restricts information to jobs belonging to the specified username
-p partition Restricts information to the specified partition
-l (long) Displays information including:  time requested, time used, number of requested nodes, the nodes on which a job is running, job state and the reason why a job is waiting to run.

 See the man page for squeue for more options, for a discussion of the codes for job state and for why a job is waiting to run.

 

scancel

The scancel command is used to kill a job in a partition, whether it is running or still waiting to run.  Specify the jobid for the job you want to kill.  For example,

scancel 12345

kills job # 12345.

 

sacct

The sacct command can be used to display detailed information about jobs. It is especially useful in investigating why one of your jobs failed. The general format of the command is

    sacct -X -j jjjjjj -S MMDDYY --format parameter1,parameter2, ...

 

For 'jjjjjj' substitute the jobid of the job you are investigating. The date given for the -S option is the date at which sacct begins searching for information about your job.

The --format option determines what information to display about a job. Useful paramaters are JobID, Partition, Account, ExitCode, State, Start, End, Elapsed, NodeList, NNodes, MaxRSS and AllocCPUs. The ExitCode and State parameters are especially useful in determining why a job failed. NNodes displays how many nodes your job used, while AllocCPUs displays how many cores your job used. MaxRSS displays how much memory your job used. The commas between the parameters in the --format option cannot be followed by spaces.

See the man page for sacct for more information about the sacct command.

 

More help

There are man pages for all the SLURM commands. SLURM also has extensive online documentation.

 

New on Bridges

GPUs to be allocated separately
Read more

Upgraded scratch file system installed
Read more

Runtime limit on LM nodes now 2 weeks
Read more

DDT version 7.0 installed
DDT documentation

 

Omni-Path User Group

The Intel Omni-Path Architecture User Group is open to all interested users of Intel's Omni-Path technology.

More information on OPUG