Opteron Cluster Upgrade
The Opteron cluster is being upgraded. This document explains the upgrades, what changes you will see and how your jobs must change. Please note that this document will change as the upgrade progresses; please check it often for the latest information.
If you have further questions, please contact PSC User Services.
Upgrades include:
All nodes will be connected via a Quadrics (elan4) network. This will improve the speed of internode communication.
Scratch space on each node (/local) will increase to 150 Gbytes.
The scheduling system will move from PBS to SLURM (Simple Linux Utility for Resource Management).
Many things, however, will remain the same. Your username and account will not change. All your files will remain. All software installed currently on the cluster will be available after the upgrade. You will want to recompile your codes to take advantage of the upgraded compilers and the Quadrics interconnects.
Topics addressed in this document are:
SLURM Commands
SLURM commands are used to submit jobs, run executables and scripts, and manage compute jobs. A table showing PBS commands and their corresponding SLURM commands is given below.
SLURM contains the commands (following a link takes you to the man page):
- srun, queue and execute jobs
- scancel, signal or cancel jobs
- sinfo, view information about nodes or partitions
- smap, graphically view information about jobs and partitions and set configuration parameters
- squeue, view jobs in a queue
- scontrol, view and modify configuration and state
Man pages for all SLURM commands are available.
PBS- SLURM Equivalents
The charts below give PBS commands, qsub options, and their SLURM equivalents.
| PBS | SLURM |
|---|---|
| qsub | srun -b |
| qsub -I | srun |
| qkill | scancel |
| qstat -q | squeue |
| qstat -u | squeue |
| qalter | scontrol |
qsub-srun translation
This chart gives the srun command to use to mimic a qsub function. For more information, see the srun options example or type srun --help.
| Usage | qsub option | srun option |
|---|---|---|
| Specify queue | -q | -p |
| Specify number of nodes and processers | -lnodes=num_nodes:ppn=procs_per_node | -N num_nodes or --nodes=num_nodes -n num_processes |
| num_nodes is the minimum
number of nodes to use. The scheduler may launch the job on more than
num_nodes. A maximum node count may be given as
--nodes=min-nodes-max-nodes. -n specifies total number of processors to use, not processors per node. | ||
| Specify STDOUT, STDERR | -o, -e | -o, -e |
| If only -o is given, -e defaults to the same. See the man page for other options. | ||
| Combine STDOUT and STDERR | -j | none |
| See -o, -e above | ||
| Specify dependency | -W depend=afterany:jobid | -P jobid |
| Other flavors of dependency ("afterok", etc., and any "before" are not available. | ||
| Specify a name for the job | -N jobname | -J jobname |
| Send mail when certain job events occur | -m a -m b -m e -m abe | --mail-type=FAIL --mail-type=BEGIN --mail-type=END --mail-type=ALL |
| User to receive mail about job events | -M userlist | --mail-user=user |
Modules
The module functions will work the same as on codon and bioinformatics.
***For csh/tcshrc shells, edit your .login to load your preferred
modules.
***For bash/sh shells, edit your .profile to load your preferred modules.
Compilers
Several versions of Gnu, Intel, and PGI compilers are installed. Refer to the opteron cluster document for a list of the installed compilers and information about their use.
MPICH library versions
Several versions of the MPI MPICH library are now installed. Refer to the opteron cluster document for a table of the MPICH library versions and the compilers to use with each.
Examples
Show node states
[user@vader GCC]$ scontrol show node
NodeName=op01 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=11805
Weight=1 Features=(null) Reason=(null)
NodeName=op02 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=11805
Weight=1 Features=(null) Reason=(null)
NodeName=op03 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=11805
Weight=1 Features=(null) Reason=(null)
Show partitions and nodes
[user@vader GCC]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
biomed* up infinite 3 idle op[01-03]
List nodes and partitions that are down
[user@vader ~]$ sinfo -R
REASON NODELIST
Not responding oss[07-08]
[user@vader ~]$
Show sinfo options
[user@vader GCC]$ sinfo --help
Usage: sinfo [OPTIONS]
-a, --all show all partitions (including hidden and those
not accessible)
-b, --bg show bgblocks (on Blue Gene systems)
-d, --dead show only non-responding nodes
-e, --exact group nodes only on exact match of configuration
-h, --noheader no headers on output
-hide do not show hidden or non-accessible partitions
-i, --iterate=seconds specify an interation period
-l, --long long output - displays more information
-n, --nodes=NODES report on specific node(s)
-N, --Node Node-centric format
-o, --format=format format specification
-p, --partition=PARTITION report on specific partition
-r, --responding report only responding nodes
-R, --list-reasons list reason nodes are down or drained
-s, --summarize report state summary only
-S, --sort=fields comma seperated list of fields to sort on
-t, --states=node_state specify the what states of nodes to view
-v, --verbose verbosity level
-V, --version output version information and exit
Help options:
--help show this help message
--usage display brief usage message
Show srun options
[user@vader GCC]$ srun --help Usage: srun [OPTIONS...] executable [args...] Parallel run options: -n, --ntasks=ntasks number of tasks to run -N, --nodes=N number of nodes on which to run (N = min[-max]) -c, --cpus-per-task=ncpus number of cpus required per task -i, --input=in location of stdin redirection -o, --output=out location of stdout redirection -e, --error=err location of stderr redirection -r, --relative=n run job step relative to node n of allocation -p, --partition=partition partition requested -H, --hold submit job in held state -t, --time=minutes time limit -D, --chdir=path change remote current working directory -I, --immediate exit if resources are not immediately available -O, --overcommit overcommit resources -k, --no-kill do not kill job on node failure -K, --kill-on-bad-exit kill the job if any task terminates with a non-zero exit code -s, --share share nodes with other jobs -l, --label prepend task number to lines of stdout/err -u, --unbuffered do not line-buffer stdout/err -m, --distribution=type distribution method for processes to nodes (type = block|cyclic|hostfile) -J, --job-name=jobname name of job --jobid=id run under already allocated job --mpi=type type of MPI being used -b, --batch submit as batch job for later execution -T, --threads=threads set srun launch fanout -W, --wait=sec seconds to wait after first task exits before killing job -q, --quit-on-interrupt quit on single Ctrl-C -X, --disable-status Disable Ctrl-C status feature -v, --verbose verbose mode (multiple -v's increase verbosity) -Q, --quiet quiet mode (suppress informational messages) -d, --slurmd-debug=level slurmd debug level --core=type change default corefile format type (type="list" to list of valid formats) -P, --dependency=jobid defer job until specified jobid completes --nice[=value] decrease secheduling priority by value -U, --account=name charge job to specified account --propagate[=rlimits] propagate all [or specific list of] rlimits --mpi=type specifies version of MPI to use --prolog=program run "program" before launching job step --epilog=program run "program" after launching job step --task-prolog=program run "program" before launching task --task-epilog=program run "program" after launching task --begin=time defer job until HH:MM DD/MM/YY --mail-type=type notify on state change: BEGIN, END, FAIL or ALL --mail-user=user who to send email notification for job state changes Allocate only: -A, --allocate allocate resources and spawn a shell --no-shell don't spawn shell in allocate mode Attach to running job: -a, --attach=jobid attach to running job with specified id -j, --join when used with --attach, allow forwarding of signals and stdin. Constraint options: --mincpus=n minimum number of cpus per node --mem=MB minimum amount of real memory --tmp=MB minimum amount of temporary disk --contiguous demand a contiguous range of nodes -C, --constraint=list specify a list of constraints -w, --nodelist=hosts... request a specific list of hosts -x, --exclude=hosts... exclude a specific list of hosts -Z, --no-allocate don't allocate nodes (must supply -w) Consumable resources related options: --exclusive allocate nodes in exclusive mode when cpu consumable resource is enabled Affinity/Multi-core options: (when the task/affinity plugin is enabled) --cpu_bind= Bind tasks to CPUs q[uiet], quietly bind before task runs (default) v[erbose], verbosely report binding before task runs no[ne] don't bind tasks to CPUs (default) rank bind by task rank map_cpu:bind by mapping CPU IDs to tasks as specified where
is
, ,... mask_cpu: bind by setting CPU masks on tasks as specified where
is
, ,... Help options: --help show this help message --usage display brief usage message Other options: -V, --version output version information and exit
Run a job on one processor
[user@vader GCC]$ srun -n1 cpi
Process 0 on op01.biomed.net
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000081
Run a job on two processors
[user@vader GCC]$ srun -n2 cpi
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.000091
Process 0 on op01.biomed.net
Process 1 on op01.biomed.net
Run a job on 4 processors
[user@vader GCC]$ srun -n4 cpi
Process 0 on op01.biomed.net
Process 1 on op01.biomed.net
Process 2 on op03.biomed.net
Process 3 on op03.biomed.net
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.000302
Run a job on two nodes
[user@vader GCC]$ srun -N2 cpi
Process 0 on op01.biomed.net
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
Process 1 on op03.biomed.net
wall clock time = 0.000194
Run a job on a chosen partition
[user@vader ~]$ srun -p biomed -n6 hostname
op02.biomed.net
op02.biomed.net
op03.biomed.net
op03.biomed.net
op01.biomed.net
op01.biomed.net
[user@vader ~]
Show running jobs
[user@vader GCC]$ scontrol show jobs
JobId=469 UserId=nigra(140) GroupId=users(100)
Name=hostname
Priority=4294901757 Partition=debug BatchFlag=0
AllocNode:Sid=vader:3663 TimeLimit=UNLIMITED
JobState=COMPLETED StartTime=01/13-14:55:17 EndTime=01/13-14:55:17
NodeList=op[01,03] NodeListIndices=-1
ReqProcs=2 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
Dependency=0 Account=(null) Reason=None Network=(null)
ReqNodeList=(null) ReqNodeListIndices=-1
ExcNodeList=(null) ExcNodeListIndices=-1
JobId=471 UserId=user(20704) GroupId=users(100)
Name=cpi
Priority=4294901755 Partition=debug BatchFlag=0
AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
JobState=COMPLETED StartTime=01/13-14:56:37 EndTime=01/13-14:56:37
NodeList=op01 NodeListIndices=-1
ReqProcs=1 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
Dependency=0 Account=(null) Reason=None Network=(null)
ReqNodeList=(null) ReqNodeListIndices=-1
ExcNodeList=(null) ExcNodeListIndices=-1
JobId=472 UserId=user(20704) GroupId=users(100)
Name=cpi
Priority=4294901754 Partition=debug BatchFlag=0
AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
JobState=COMPLETED StartTime=01/13-14:56:49 EndTime=01/13-14:56:49
NodeList=op01 NodeListIndices=-1
ReqProcs=1 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
Dependency=0 Account=(null) Reason=None Network=(null)
ReqNodeList=(null) ReqNodeListIndices=-1
ExcNodeList=(null) ExcNodeListIndices=-1
JobId=473 UserId=user(20704) GroupId=users(100)
Name=cpi
Priority=4294901753 Partition=debug BatchFlag=0
AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
JobState=COMPLETED StartTime=01/13-14:56:54 EndTime=01/13-14:56:54
NodeList=op01 NodeListIndices=-1
ReqProcs=2 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
Dependency=0 Account=(null) Reason=None Network=(null)
ReqNodeList=(null) ReqNodeListIndices=-1
ExcNodeList=(null) ExcNodeListIndices=-1
JobId=474 UserId=user(20704) GroupId=users(100)
Name=cpi
Priority=4294901752 Partition=debug BatchFlag=0
AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
JobState=COMPLETED StartTime=01/13-14:57:06 EndTime=01/13-14:57:06
NodeList=op[01,03] NodeListIndices=-1
ReqProcs=4 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
Dependency=0 Account=(null) Reason=None Network=(null)
ReqNodeList=(null) ReqNodeListIndices=-1
ExcNodeList=(null) ExcNodeListIndices=-1
JobId=475 UserId=user(20704) GroupId=users(100)
Name=cpi
Priority=4294901751 Partition=debug BatchFlag=0
AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
JobState=COMPLETED StartTime=01/13-14:57:09 EndTime=01/13-14:57:09
NodeList=op[01,03] NodeListIndices=-1
ReqProcs=2 MinNodes=0 Shared=0 Contiguous=0
MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
Dependency=0 Account=(null) Reason=None Network=(null)
ReqNodeList=(null) ReqNodeListIndices=-1
ExcNodeList=(null) ExcNodeListIndices=-1
Run interactively, using a script
You can copy my.script from /etc/skel.
[user@vader ~]$ cat my.script #!/bin/bash /bin/hostname srun -l /bin/hostname srun -l /bin/pwd [user@vader ~]$ srun -N2 my.script op01.biomed.net op03.biomed.net 0: op01.biomed.net 1: op03.biomed.net 0: op01.biomed.net 1: op03.biomed.net 0: /home/user 1: /home/user 0: /home/user 1: /home/user
Run an MPICH job
This example uses the elan-enabled GNU C compiler.
[user@vader GCC]$ module load mpich-elan/1.24-47-gnu [user@vader GCC]$ module list Currently Loaded Modulefiles: 1) oscar-modules/1.0.5 3) libelanhosts/0.9-1 5) mpich-elan/1.24-47-gnu 2) slurm/1.0.0-1 4) munge/0.4.3 [user@vader GCC]$ which mpirun /usr/lib/mpi/mpi_gnu/examples/mpirunThere are generic sample test codes in /etc/skel/MPIQ_examples/. Copy them to your directory.
[user@vader ~]$ cp -ar /etc/skel/MPIQ_examples/ . [user@vader ~]$ ls MPIQ_examples/ GCC IntelTry out the GNU examples.
[user@vader ~]$ cd MPIQ_examples/GCC [user@vader GCC]$ ls cpi cpi.o elanidmap MPI-2-C++ pi3.o simpleio.c cpi.c cpip.c hello++.cc mpirun pi3p.f cpilog.c cpip.o Makefile pi3.f pi3p.o [user@vader GCC]$ which mpicc /usr/lib/mpi/mpi_gnu/bin/mpicc [user@vader GCC]$ make cpi /usr/lib/mpi/mpi_gnu/bin/mpicc -o cpi cpi.o -lm [user@vader GCC]$ srun -N 2 -n 4 cpi Process 2 on op02.biomed.net Process 0 on op01.biomed.net Process 3 on op02.biomed.net Process 1 on op01.biomed.net pi is approximately 3.1416009869231249, Error is 0.0000083333333318 wall clock time = 0.000197