Opteron Cluster Upgrade

The Opteron cluster is being upgraded. This document explains the upgrades, what changes you will see and how your jobs must change. Please note that this document will change as the upgrade progresses; please check it often for the latest information.

If you have further questions, please contact PSC User Services.

Upgrades include:

    All nodes will be connected via a Quadrics (elan4) network. This will improve the speed of internode communication.

    Scratch space on each node (/local) will increase to 150 Gbytes.

    The scheduling system will move from PBS to SLURM (Simple Linux Utility for Resource Management).

Many things, however, will remain the same. Your username and account will not change. All your files will remain. All software installed currently on the cluster will be available after the upgrade. You will want to recompile your codes to take advantage of the upgraded compilers and the Quadrics interconnects.

Topics addressed in this document are:

SLURM Commands

SLURM commands are used to submit jobs, run executables and scripts, and manage compute jobs. A table showing PBS commands and their corresponding SLURM commands is given below.

SLURM contains the commands (following a link takes you to the man page):

  • srun, queue and execute jobs
  • scancel, signal or cancel jobs
  • sinfo, view information about nodes or partitions
  • smap, graphically view information about jobs and partitions and set configuration parameters
  • squeue, view jobs in a queue
  • scontrol, view and modify configuration and state

Man pages for all SLURM commands are available.

PBS- SLURM Equivalents

The charts below give PBS commands, qsub options, and their SLURM equivalents.

PBSSLURM
qsub srun -b
qsub -I srun
qkill scancel
qstat -q squeue
qstat -u squeue
qalterscontrol

qsub-srun translation

This chart gives the srun command to use to mimic a qsub function. For more information, see the srun options example or type srun --help.

Usage qsub option srun option
Specify queue -q -p
Specify number of nodes and processers -lnodes=num_nodes:ppn=procs_per_node -N num_nodes or --nodes=num_nodes

-n num_processes
num_nodes is the minimum number of nodes to use. The scheduler may launch the job on more than num_nodes. A maximum node count may be given as --nodes=min-nodes-max-nodes.

-n specifies total number of processors to use, not processors per node.
Specify STDOUT, STDERR -o, -e -o, -e
If only -o is given, -e defaults to the same. See the man page for other options.
Combine STDOUT and STDERR -j none
See -o, -e above
Specify dependency -W depend=afterany:jobid -P jobid
Other flavors of dependency ("afterok", etc., and any "before" are not available.
Specify a name for the job -N jobname -J jobname
Send mail when certain job events occur -m a
-m b
-m e
-m abe
--mail-type=FAIL
--mail-type=BEGIN
--mail-type=END
--mail-type=ALL
User to receive mail about job events -M userlist --mail-user=user

Modules

The module functions will work the same as on codon and bioinformatics.

***For csh/tcshrc shells, edit your .login to load your preferred modules.
***For bash/sh shells, edit your .profile to load your preferred modules.

Compilers

Several versions of Gnu, Intel, and PGI compilers are installed. Refer to the opteron cluster document for a list of the installed compilers and information about their use.

MPICH library versions

Several versions of the MPI MPICH library are now installed. Refer to the opteron cluster document for a table of the MPICH library versions and the compilers to use with each.

Examples

Show node states

[user@vader GCC]$ scontrol show node
NodeName=op01 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=11805
   Weight=1 Features=(null) Reason=(null)
NodeName=op02 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=11805
   Weight=1 Features=(null) Reason=(null)
NodeName=op03 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=11805
   Weight=1 Features=(null) Reason=(null)

Show partitions and nodes

[user@vader GCC]$ sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
biomed*      up   infinite     3   idle op[01-03]

List nodes and partitions that are down

[user@vader ~]$ sinfo -R
 REASON                              NODELIST
 Not responding                      oss[07-08]
[user@vader ~]$

Show sinfo options

[user@vader GCC]$ sinfo --help
Usage: sinfo [OPTIONS]
  -a, --all                  show all partitions (including hidden and those
                             not accessible)
  -b, --bg                   show bgblocks (on Blue Gene systems)
  -d, --dead                 show only non-responding nodes
  -e, --exact                group nodes only on exact match of configuration
  -h, --noheader             no headers on output
  -hide                      do not show hidden or non-accessible partitions
  -i, --iterate=seconds      specify an interation period
  -l, --long                 long output - displays more information
  -n, --nodes=NODES          report on specific node(s)
  -N, --Node                 Node-centric format
  -o, --format=format        format specification
  -p, --partition=PARTITION  report on specific partition
  -r, --responding           report only responding nodes
  -R, --list-reasons         list reason nodes are down or drained
  -s, --summarize            report state summary only
  -S, --sort=fields          comma seperated list of fields to sort on
  -t, --states=node_state    specify the what states of nodes to view
  -v, --verbose              verbosity level
  -V, --version              output version information and exit

Help options:
  --help                     show this help message
  --usage                    display brief usage message

Show srun options

[user@vader GCC]$ srun --help
Usage: srun [OPTIONS...] executable [args...]

Parallel run options:
  -n, --ntasks=ntasks         number of tasks to run
  -N, --nodes=N               number of nodes on which to run (N = min[-max])
  -c, --cpus-per-task=ncpus   number of cpus required per task
  -i, --input=in              location of stdin redirection
  -o, --output=out            location of stdout redirection
  -e, --error=err             location of stderr redirection
  -r, --relative=n            run job step relative to node n of allocation
  -p, --partition=partition   partition requested
  -H, --hold                  submit job in held state
  -t, --time=minutes          time limit
  -D, --chdir=path            change remote current working directory
  -I, --immediate             exit if resources are not immediately available
  -O, --overcommit            overcommit resources
  -k, --no-kill               do not kill job on node failure
  -K, --kill-on-bad-exit      kill the job if any task terminates with a
                              non-zero exit code
  -s, --share                 share nodes with other jobs
  -l, --label                 prepend task number to lines of stdout/err
  -u, --unbuffered            do not line-buffer stdout/err
  -m, --distribution=type     distribution method for processes to nodes
                              (type = block|cyclic|hostfile)
  -J, --job-name=jobname      name of job
      --jobid=id              run under already allocated job
      --mpi=type              type of MPI being used
  -b, --batch                 submit as batch job for later execution
  -T, --threads=threads       set srun launch fanout
  -W, --wait=sec              seconds to wait after first task exits
                              before killing job
  -q, --quit-on-interrupt     quit on single Ctrl-C
  -X, --disable-status        Disable Ctrl-C status feature
  -v, --verbose               verbose mode (multiple -v's increase verbosity)
  -Q, --quiet                 quiet mode (suppress informational messages)
  -d, --slurmd-debug=level    slurmd debug level
      --core=type             change default corefile format type
                              (type="list" to list of valid formats)
  -P, --dependency=jobid      defer job until specified jobid completes
      --nice[=value]          decrease secheduling priority by value
  -U, --account=name          charge job to specified account
      --propagate[=rlimits]   propagate all [or specific list of] rlimits
      --mpi=type              specifies version of MPI to use
      --prolog=program        run "program" before launching job step
      --epilog=program        run "program" after launching job step
      --task-prolog=program   run "program" before launching task
      --task-epilog=program   run "program" after launching task
      --begin=time            defer job until HH:MM DD/MM/YY
      --mail-type=type        notify on state change: BEGIN, END, FAIL or ALL
      --mail-user=user        who to send email notification for job state changes

Allocate only:
  -A, --allocate              allocate resources and spawn a shell
      --no-shell              don't spawn shell in allocate mode

Attach to running job:
  -a, --attach=jobid          attach to running job with specified id
  -j, --join                  when used with --attach, allow forwarding of
                              signals and stdin.

Constraint options:
      --mincpus=n             minimum number of cpus per node
      --mem=MB                minimum amount of real memory
      --tmp=MB                minimum amount of temporary disk
      --contiguous            demand a contiguous range of nodes
  -C, --constraint=list       specify a list of constraints
  -w, --nodelist=hosts...     request a specific list of hosts
  -x, --exclude=hosts...      exclude a specific list of hosts
  -Z, --no-allocate           don't allocate nodes (must supply -w)

Consumable resources related options:
      --exclusive             allocate nodes in exclusive mode when
                              cpu consumable resource is enabled

Affinity/Multi-core options: (when the task/affinity plugin is enabled)
      --cpu_bind=             Bind tasks to CPUs
             q[uiet],           quietly bind before task runs (default)
             v[erbose],         verbosely report binding before task runs
             no[ne]             don't bind tasks to CPUs (default)
             rank               bind by task rank
             map_cpu:     bind by mapping CPU IDs to tasks as specified
                                where  is ,,...
             mask_cpu:    bind by setting CPU masks on tasks as specified
                                where  is ,,...

Help options:
      --help                  show this help message
      --usage                 display brief usage message

Other options:
  -V, --version               output version information and exit

Run a job on one processor

[user@vader GCC]$ srun -n1 cpi
Process 0 on op01.biomed.net
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000081

Run a job on two processors

[user@vader GCC]$ srun -n2 cpi
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.000091
Process 0 on op01.biomed.net
Process 1 on op01.biomed.net

Run a job on 4 processors

[user@vader GCC]$ srun -n4 cpi
Process 0 on op01.biomed.net
Process 1 on op01.biomed.net
Process 2 on op03.biomed.net
Process 3 on op03.biomed.net
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.000302

Run a job on two nodes

[user@vader GCC]$ srun -N2 cpi
Process 0 on op01.biomed.net
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
Process 1 on op03.biomed.net
wall clock time = 0.000194

Run a job on a chosen partition

[user@vader ~]$ srun -p biomed -n6 hostname
op02.biomed.net
op02.biomed.net
op03.biomed.net
op03.biomed.net
op01.biomed.net
op01.biomed.net
[user@vader ~]

Show running jobs

[user@vader GCC]$ scontrol show jobs
JobId=469 UserId=nigra(140) GroupId=users(100)
   Name=hostname
   Priority=4294901757 Partition=debug BatchFlag=0
   AllocNode:Sid=vader:3663 TimeLimit=UNLIMITED
   JobState=COMPLETED StartTime=01/13-14:55:17 EndTime=01/13-14:55:17
   NodeList=op[01,03] NodeListIndices=-1
   ReqProcs=2 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   Dependency=0 Account=(null) Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=-1
   ExcNodeList=(null) ExcNodeListIndices=-1

JobId=471 UserId=user(20704) GroupId=users(100)
   Name=cpi
   Priority=4294901755 Partition=debug BatchFlag=0
   AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
   JobState=COMPLETED StartTime=01/13-14:56:37 EndTime=01/13-14:56:37
   NodeList=op01 NodeListIndices=-1
   ReqProcs=1 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   Dependency=0 Account=(null) Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=-1
   ExcNodeList=(null) ExcNodeListIndices=-1

JobId=472 UserId=user(20704) GroupId=users(100)
   Name=cpi
   Priority=4294901754 Partition=debug BatchFlag=0
   AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
   JobState=COMPLETED StartTime=01/13-14:56:49 EndTime=01/13-14:56:49
   NodeList=op01 NodeListIndices=-1
   ReqProcs=1 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   Dependency=0 Account=(null) Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=-1
   ExcNodeList=(null) ExcNodeListIndices=-1

JobId=473 UserId=user(20704) GroupId=users(100)
   Name=cpi
   Priority=4294901753 Partition=debug BatchFlag=0
   AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
   JobState=COMPLETED StartTime=01/13-14:56:54 EndTime=01/13-14:56:54
   NodeList=op01 NodeListIndices=-1
   ReqProcs=2 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   Dependency=0 Account=(null) Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=-1
   ExcNodeList=(null) ExcNodeListIndices=-1

JobId=474 UserId=user(20704) GroupId=users(100)
   Name=cpi
   Priority=4294901752 Partition=debug BatchFlag=0
   AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
   JobState=COMPLETED StartTime=01/13-14:57:06 EndTime=01/13-14:57:06
   NodeList=op[01,03] NodeListIndices=-1
   ReqProcs=4 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   Dependency=0 Account=(null) Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=-1
   ExcNodeList=(null) ExcNodeListIndices=-1

JobId=475 UserId=user(20704) GroupId=users(100)
   Name=cpi
   Priority=4294901751 Partition=debug BatchFlag=0
   AllocNode:Sid=vader:9204 TimeLimit=UNLIMITED
   JobState=COMPLETED StartTime=01/13-14:57:09 EndTime=01/13-14:57:09
   NodeList=op[01,03] NodeListIndices=-1
   ReqProcs=2 MinNodes=0 Shared=0 Contiguous=0
   MinProcs=0 MinMemory=0 Features=(null) MinTmpDisk=0
   Dependency=0 Account=(null) Reason=None Network=(null)
   ReqNodeList=(null) ReqNodeListIndices=-1
   ExcNodeList=(null) ExcNodeListIndices=-1

Run interactively, using a script

You can copy my.script from /etc/skel.

[user@vader ~]$ cat my.script
#!/bin/bash
/bin/hostname
srun -l /bin/hostname
srun -l /bin/pwd

[user@vader ~]$ srun -N2 my.script
op01.biomed.net
op03.biomed.net
0: op01.biomed.net
1: op03.biomed.net
0: op01.biomed.net
1: op03.biomed.net
0: /home/user
1: /home/user
0: /home/user
1: /home/user

Run an MPICH job

This example uses the elan-enabled GNU C compiler.

[user@vader GCC]$ module load mpich-elan/1.24-47-gnu

[user@vader GCC]$ module list
Currently Loaded Modulefiles:
  1) oscar-modules/1.0.5      3) libelanhosts/0.9-1       5) mpich-elan/1.24-47-gnu
  2) slurm/1.0.0-1            4) munge/0.4.3

[user@vader GCC]$ which mpirun
/usr/lib/mpi/mpi_gnu/examples/mpirun

There are generic sample test codes in /etc/skel/MPIQ_examples/. Copy them to your directory.

[user@vader ~]$ cp -ar /etc/skel/MPIQ_examples/ .

[user@vader ~]$ ls MPIQ_examples/
GCC  Intel

Try out the GNU examples.

[user@vader ~]$ cd MPIQ_examples/GCC

[user@vader GCC]$ ls
cpi       cpi.o   elanidmap   MPI-2-C++  pi3.o   simpleio.c
cpi.c     cpip.c  hello++.cc  mpirun     pi3p.f
cpilog.c  cpip.o  Makefile    pi3.f      pi3p.o

[user@vader GCC]$ which mpicc
/usr/lib/mpi/mpi_gnu/bin/mpicc

[user@vader GCC]$ make cpi
/usr/lib/mpi/mpi_gnu/bin/mpicc -o cpi cpi.o -lm

[user@vader GCC]$ srun -N 2 -n 4 cpi
Process 2 on op02.biomed.net
Process 0 on op01.biomed.net
Process 3 on op02.biomed.net
Process 1 on op01.biomed.net
pi is approximately 3.1416009869231249, Error is 0.0000083333333318
wall clock time = 0.000197