The 
Opteron Cluster


The Opteron Cluster

TABLE OF CONTENTS

System Architecture and Configuration

PSC's Opteron cluster consists of three functional divisions: the cluster front ends which provide infrastructure and management services for the cluster, the computational nodes, and RAID storage nodes.

The cluster's two front ends are called codon.psc.edu and bioinformatics.psc.edu. Both nodes have dual 1.6 Ghz AMD Opteron processors which share a 8 Gbyte memory. All compilations must be done on the front end.

Twenty compute nodes provide the computational power of the cluster. Each compute node has two 1.4 Ghz AMD Opteron processors which share a 4 Gbyte memory. All program execution takes place on the compute nodes.

The two RAID servers provide 4 Tbytes of user and scratch space.

All nodes are connected via a Quadrics network. Intranode communications use the shared memory, while communications between nodes use Quadrics.

The compute nodes are controlled by the SLURM (Simple Linux Utility for Resource Management) scheduler, for both serial and parallel jobs.

Stay Informed

As a user of the Opteron cluster, it is imperative that you stay informed of changes to the machine's environment. Important information is posted to the Center's general bboards, which can be read through various facilities on most PSC systems.

Access to the Cluster

Getting an allocation

There are two types of grants available: starter grants and production grants. Starter grants are appropriate as precursors to large requests. Production grants are large awards for users with extensive computational requirements.

See http://www.psc.edu/grants/ for further information on submitting a proposal for a grant.

Connecting to the cluster

To access the opteron cluster, use ssh to connect to either bioinformatics.psc.edu or codon.psc.edu. You must use the ssh2 protocol.

Cluster passwords

Your cluster password is your PSC Kerberos password. You have the same password on all the cluster nodes. Use the kpasswd command on the front end node to change your password. Do not use the passwd command to change your password. You have the same password on all PSC production systems. If you change your password on one system using kpasswd it will change on all PSC production systems.

Accounting

Checking your usage

Accounting information for grants is also available at the PSC Grant Management System on the Web at https://grants.psc.edu/arms.

You will need your PSC Kerberos password to access this system. This system can provide more detailed information than xbanner, although some of the information is only available to grant PIs. The system has extensive internal documentation.

File Systems

There are four file areas on the cluster.

User home directories - /usr/users/n/username
This is your home directory on the cluster, where n is a single digit. All the opteron cluster nodes and bioinformatics.psc.edu share this NFS file system. The total space for all users is 4 Tbytes. Quotas are not currently enforced; however, we ask you to police your own file usage on this system and we reserve the right to enforce quotas if the need arises.


/scratch
Each user has his own scratch space available. This space is shared among every node in the cluster. Scratch space is intended for temporary storage only and is not backed up.


/local
Temporary storage is available in /local on every compute node. Each node has its own /local system; they are not shared between nodes. However, on a given node, this directory is shared by all users. You can make a personal directory under /local, but be aware that /local is intended for temporary storage only and is not backed up.

HSM
The PSC Hierarchical Storage Manager is used for archival storage and is available from the Opteron cluster and also from BigBen, rachel and jonas.

The archiver runs on golem.psc.edu. The far (File ARchiver) interface includes the commands to store and retrieve files. More information on golem.psc.edu is available at http://www.psc.edu/general/filesys/far/.

We recommend that you store all of your important files in the HSM. Furthermore, far should only be used interactively, so that PEs are not tied up in a batch job waiting for file storage or retrieval.

Transferring Files to the Cluster

Either the secure copy program, scp, or far can be used to transfer files in and out of the cluster file space.

scp

When using scp to copy files into the cluster, connect to bioinformatics.psc.edu or codon.psc.edu.

The format for scp is:

scp source-filename   target-filename

where the filename on the remote system (whether it is the target or the source) must be specified as:

username@system:filename

For example, to copy a file to the cluster when you are logged on to a different machine, type:

scp filename username@codon.psc.edu:filename

This command copies a file into your home directory.

To copy a file to the cluster from another system while logged into codon.psc.edu, type:

scp username@remote-system:filename   filename

The first time you transfer files to/from another system, you will receive a message similar to:

Host key not found from list of known hosts.  Are you sure 
you want to continue connecting (yes/no)?

Answer 'yes' to make the connection. The next time you connect to that host, you will not receive that message.

You will be prompted next for your password on the remote system. For more information on the scp command, see the scp man page.

scp is part of the ssh distribution.

far

You can also move files between the PSC HSM and the cluster. See the section on file systems for details.

Module software

The Module package provides for the dynamic modification of a users's environment via module files. Module can be used:

  • to manage multiple versions of applications, tools and libraries
  • to manage software where complex changes to the environment are necessary
  • to manage software where name conflicts with other software would cause problems

Loading a module for a particular piece of software often adds the path to the executable to $PATH, the path to the library to $LD_LIBRARY_PATH, and so on. Loading a module, then, relieves the user of having to remember or look up and type long path names.

Some useful module commands are:

module avail
lists available modules
module list
lists currently loaded modules
module help foo
help on module foo
module whatis foo
brief description of module foo
module display foo
displays the changes that are made to the environment by loading module foo without actually loading it.
module load foo
load module foo
module unload foo
unloads module foo and removes all changes that it made in the environment.
module clear
unloads all modules

Compilers and Other Programming Tools

Gnu, Intel and Portland compilers are installed on the opteron front-ends.

Compiling a serial program

Compilation commands and modules to use for serial programs are as follows:

CompilerModuleCommand
Portland C pgi32   (32-bit binaries)
pgi64   (64-bit binaries)
pgcc prog.c
Portland C++ pgi32   (32-bit binaries)
pgi64   (64-bit binaries)
pgCC prog.C
Portland f77 pgi32   (32-bit binaries)
pgi64   (64-bit binaries)
pgf77 prog.f
Portland f90 pgi32   (32-bit binaries)
pgi64   (64-bit binaries)
pgf90 prog.f90
GNU C None needed gcc prog.c
GNU C++ None needed g++ prog.C
GNU f77 None needed f77 prog.f
Intel C intelcc/8.1
intelcce/9.1
icc prog.c
Intel C++ intelcc/8.1
intelcce/9.1
icpc prog.C
Intel Fortran intelfc/8.1
intelfce/9.1
ifort prog.f   (f77)
ifort prog.f90   (f90)

Compiling parallel programs with MPICH MPI

To compile a parallel program using MPICH MPI, choose the compiler you wish to use and load the modules for the compiler and the corresponding MPICH library.

The table shows which module to load, and which MPICH library works with each compiler.

Compiler module MPICH version MPICH module    Elan enabled?
Portland Group
pgi64 (64-bit binaries) 1
2
mpich-1.2.7p1/pgi
mpich2-1.0.3/pgi
No
Gnu
None needed 1 mpich-elan/1.24-47-gnu Yes
Intel
intelcce/9.1
intelfce/9.1
2 mpich2-1.0.3/intel No
intelcce/9.1
intelfce/9.1
1 mpich-elan/1.24-47-intel Yes

The commands to compile an MPICH program are the same regardless of the compiler type (GNU, Intel, Portland Group) used. Once the compiler and MPICH modules are loaded, use:

Compiler     Command
C mpicc
f77 mpif77
f90 mpif90

Compiling 32- vs. 64-bit binaries

By default, compiling a program on codon produces 64-bit binaries. This can be a problem if the code is not "64-bit clean". Things that cause a code to not be "64-bit clean" include:

  • casting a pointer to an integer and then back to a pointer
  • incrementing pointers by an integer offset.

If your code is not "64-bit clean", recompile it to produce 32-bit binaries.

Portland compilers
Load the pgi32 module.
GNU compilers
Use the -m32 compilation flag. (Using -m64 produces 64-bit binaries.)
gcc -m32 -o executable myprog.c
Intel compilers
The Intel compilers produce only 32-bit binaries.

Third-party Software

For a complete list of installed third-party software, see the list of installed software by platform.

If you have a question about a software program that is not documented, contact us.

Running a Job on the Cluster

Batch access

The Simple Linux Utility for Resource Management (SLURM) scheduler controls all access to the cluster's compute nodes. Only one 'partition', or queue, exists.

SLURM commands are used to submit jobs, run executables and scripts, and manage compute jobs. A table showing PBS commands and their corresponding SLURM commands is given in the upgrade document, http://www.psc.edu/machines/opteron/OpteronUpgrade.html.

SLURM commands are (the links lead to the man pages):

  • srun, queue and execute jobs
  • scancel, signal or cancel jobs
  • sinfo, view information about nodes or partitions
  • smap, graphically view information about jobs and partitions and set configuration parameters
  • squeue, view jobs in queue
  • scontrol, view and modify configuration and state

Man pages for all SLURM commands are available on the opteron cluster also.

Scheduling policies

Jobs are executed in FIFO order. If a job cannot run because of insufficient resources, however, other jobs submitted subsequently can execute if there are sufficient resources for the later jobs.

There is no checkpointing for any job.

A running job does not share processors with other jobs.

The srun command

Use the srun command to submit a job script to SLURM. A job script consists of SLURM directives, comments, and executable statements. A discussion of job scripts for both serial and parallel jobs follows the description of the srun command.

The simplest version of the srun command is:

srun -b script.sh

Common options for the srun command are:

-t minutes or --time=minutes
This specifies the cpu time limit for the job in minutes.
-N num_nodes or --nodes=num_nodes
This specifies the minimum number of nodes the job will use. The job may be launched on more than num_nodes. A maximum number of nodes to use can be given as --nodes=min_nodes-max_nodes.
-n processes or --ntasks=processes
Requests that srun allocate processes processes. The default is one process per node, but the -c flag can change that.
-o and -e
specifies pathnames for standard output and error, respectively. If only -o is given, -e defaults to the same.
-c ncpus or --cpus-per-task=ncpus
Requests ncpus per process. The default is one.
--mail-type=option
specifies if and when mail is sent about job execution. If the option is :
  • FAIL, mail is sent when the job is aborted by the system
  • BEGIN, mail is sent when the job begins execution
  • END, mail is sent when the job terminates
  • ALL, mail is sent when any of the above events occurs.
--mail-user=username
specifies the user to whom mail is sent about the job. If username is omitted, mail is sent to the job owner.
-P job-id
specifies that the current job may not begin until job job-id ends, with or without errors. No other types of dependencies can be specified.

Running a job

Follow these steps to run a job:

  1. Get your source code and data files to one of the cluster's file systems with far or scp or create them there.
  2. Log in to the front end node with ssh.
  3. Compile your program
  4. Create a script that executes your program and performs any other operations that your executable needs to run successfully.
  5. Make this script executable with the chmod command.
    chmod 755 yourscript
    
  6. Submit the script for execution with the srun command.
    srun -b  yourscript
    

Batch output--your job's standard output and standard error output--is returned to the directory from which you issued the srun command when your job finishes.

Sample job script

A sample script for a job is:

#!/usr/bin/csh
#SLURM -t 5
#SLURM -N 1
#SLURM -n 2

# No need to copy over files
# All input data and the executable are in this directory

srun  ./yourprog

The first line in the script designates the shell to use. The next 3 lines in the script are SLURM directives. The first states that this job has a maximum CPU time limit of 5 minutes. The second SLURM directive indicates that you want to use one node, and the third asks for two processors.

The other 2 lines that begin with '#' are comments. The '#' for comments and SLURM directives must be in column one of your script file.

The remaining line in the script uses the srun command to runs your executable, which you should have previously compiled on bioinformatics or codon. You must use the full path to the executable. You could have other commands if you need to change directories, copy in files to the working directory, store output data to a different directory, etc.

You can also specify your SLURM directives as command options to srun. Thus, you could omit the SLURM directives in the sample script above and submit the script with

srun -t 5:00 -N 1 -n 1 -b yourscript

Using both CPUs with Non-threaded Serial Jobs

Each node on PSC's Opteron cluster has 2 CPUs with 4Gb of shared memory. When a non-threaded serial job runs, only a single CPU is used, and half of the node's computational capacity is wasted. Fortunately, if a single running instance of an executable consumes less than half of the available amount memory of per node (i.e., 2Gb), there is a very easy and straightforward way to use both CPUs in a node and hence effectively double throughput.

Assume a serial job is submitted using the script run_1.sh with the command line:

[me@codon] srun -N1 -b -p nolimit run_1.sh.

To run two executables per node, simply pick two serial jobs with their corresponding scripts, run_1.sh and run_2.sh, say, and then generate a new script called, e.g., run_both_jobs.sh, with the following content:

#!/bin/bash

# Use '&' to move the first job to the background
./run_1.sh &
./run_2.sh 

# Use 'wait' as a barrier to collect both executables when they are done.
wait

The scripts could, of course, be in different directories (just change the executable name to include the path, e.g., ./run_1.sh to /path/to/run_1.sh). They both also need to be made executable with, e.g., chmod u+x run_1.sh. Submit the new queue script via:

[me@codon] srun -N1 -b -p nolimit run_both_jobs.sh.

Both run_1.sh and run_2.sh will now happily run on a single node, thereby maximizing throughput. Of course, once one of the two jobs is finished, the node will again only run the remaining single job until it is finished as well. Hence, it is useful to collect pairs of jobs with approximately equal runtimes, if possible.

Sequence Analysis Jobs

Sequence analysis users will find it convenient to use the /biomed/lib/examples/.login.fixed login file. It adds to your default path directories where many sequence analysis packages are stored, sets up the environment for EMBOSS, and defines some environment variables for sequence analysis packages.

login.fixed also contains other commands not specific to sequence analysis that set the prompt and define the terminal type.

Debugging

The GDB debugger is available on the opteron cluster.

To use GDB, set SHELL to be /bin/bash or another standard shell, since the /usr/psc/shells are incompatible with GDB. For instance, if your login shell is /usr/psc/shells/bash, add the line

export SHELL=/bin/bash

to both your .bash_profile and .bashrc files.

To generate code that GDB can read, use the -g option when compiling, and the gnu compiler.

gcc -g prog.c

Compiling with optimization on can cause complications for debugging. Often, the values of local variables are lost. To avoid this, turn off optimization when compiling for debugging purposes:

GDB reads a core file. By default, a core file is NOT written when a program ends unsuccessfully on the opteron cluster.

To have a core file created, use the ulimit (sh) or unlimit (csh) command:

ulimit -c unlimited
unlimit  coredumpsize

This command can be part of the job script, or invoked in the shell before the job script is submitted.

Once the core file is created, invoke GDB with:

gdb executable corefile

Typing help once gdb starts gives extensive information on gdb usage and commands. A man page is also available.

Debugging a parallel program

To debug a parallel MPI program using gdb, first enable the cluster nodes to access to your X server with:

xhost +operon01
xhost +operon02
...

on the workstation for each of the cluster nodes that the parallel program could potentially run on.

Then start the program, with each debugged process having its own window, using:

srun -b -nprocesses /bin/env DISPLAY=display_name xterm -e gdb prog

For example,

srun -b -n4 /bin/env DISPLAY=workstation.psc.edu:0 xterm -e gdb ./myprog

will create 4 windows on workstation.psc.edu's display, each containing a gdb debugger in which the gdb "run" command will start up myprog.

If a program takes many arguments, or there are several initial commands to give to gdb, avoid typing them in each window by using gdb's -x option to read them out of a file:

srun -b -nprocesses /bin/env DISPLAY=display_name xterm -e gdb -x cmds  yourprog

wher the file cmds might contain something like:

break 19
break 430
break 1027
run arg1 arg2 arg3

Other SLURM commands

squeue

The squeue command is used to display the status of queued jobs. Use the -f option for a more extensive status listing. The -u username option displays the status of jobs user username is running.

codon.psc.edu> squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   2064       all    go.sh    jones   R 1-01:53:01      7 operon[13-18,20]
   2222       all alignace    smith   R       0:03      1 operon11
   2223       all fasta.sh    black   R       0:04      1 operon08

sinfo

The sinfo command displays information about nodes and partitions.

codon.psc.edu> sinfo
PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST
all*         up   infinite     1  down* operon10
all*         up   infinite     7  alloc operon[13-18,20]
all*         up   infinite    10   idle operon[01-08,11-12]
devel        up   infinite     2   idle bioinformatics,codon
papi         up   infinite     1   down operon19

scancel

The scancel command is used to signal or cancel jobs.

codon> squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   2064       all    go.sh      smp   R 1-02:23:14      7 operon[13-18,20]
   2224       all  seqanal    nigra   R       0:12      1 operon11
   2225       all  seqsort    nigra   R       0:06      1 operon12
   2226       all    amber    nigra   R       0:02      1 operon01
codon> scancel 2225
codon> squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   2064       all    go.sh      smp   R 1-02:23:28      7 operon[13-18,20]
   2224       all seqanal.    nigra   R       0:26      1 operon11
   2226       all amber.sh    nigra   R       0:16      1 operon01
codon> scancel -u nigra
codon> squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   2064       all    go.sh      smp   R 1-02:23:36      7 operon[13-18,20]

scontrol

The scontrol command is used to view and alter configurations for queued and running jobs.

codon> scontrol show nodes
NodeName=operon01 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon02 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon03 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon04 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon05 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon06 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon07 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon08 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon09 State=DOWN* CPUs=2 RealMemory=3016 TmpDisk=15748
   Weight=1 Features=(null) Reason=<-----Maintenance----->
NodeName=operon10 State=DOWN* CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=Not responding [slurm@06/29-12:36:58]
NodeName=operon11 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon12 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon13 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon14 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon15 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon16 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon17 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon18 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=operon19 State=DOWN CPUs=2 RealMemory=2007 TmpDisk=15748
   Weight=1 Features=(null) Reason=Low RealMemory [slurm@06/27-00:10:15]
NodeName=operon20 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=codon State=IDLE CPUs=2 RealMemory=6961 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)
NodeName=bioinformatics State=IDLE CPUs=2 RealMemory=6960 TmpDisk=15748
   Weight=1 Features=(null) Reason=(null)

See the man pages for these commands for additional information.

SLURM documentation

SLURM documentation is available online at http://www.llnl.gov/linux/slurm/.