The Opteron Cluster
TABLE OF CONTENTS
- System Architecture and Configuration
- Stay Informed
- Access to the Cluster
- Accounting
- File Systems
- Transferring Files to the Cluster
- Module software
- Compilers and Other Programming Tools
- Compiling a serial program
- Compiling a parallel program
- Compiling 32- vs. 64-bit binaries
- Third-party software
- Running a Job on the Cluster
System Architecture and Configuration
PSC's Opteron cluster consists of three functional divisions: the cluster front ends which provide infrastructure and management services for the cluster, the computational nodes, and RAID storage nodes.
The cluster's two front ends are called codon.psc.edu and bioinformatics.psc.edu. Both nodes have dual 1.6 Ghz AMD Opteron processors which share a 8 Gbyte memory. All compilations must be done on the front end.
Twenty compute nodes provide the computational power of the cluster. Each compute node has two 1.4 Ghz AMD Opteron processors which share a 4 Gbyte memory. All program execution takes place on the compute nodes.
The two RAID servers provide 4 Tbytes of user and scratch space.
All nodes are connected via a Quadrics network. Intranode communications use the shared memory, while communications between nodes use Quadrics.
The compute nodes are controlled by the SLURM (Simple Linux Utility for Resource Management) scheduler, for both serial and parallel jobs.
Stay Informed
As a user of the Opteron cluster, it is imperative that you stay informed of changes to the machine's environment. Important information is posted to the Center's general bboards, which can be read through various facilities on most PSC systems.
Access to the Cluster
Getting an allocation
There are two types of grants available: starter grants and production grants. Starter grants are appropriate as precursors to large requests. Production grants are large awards for users with extensive computational requirements.
See http://www.psc.edu/grants/ for further information on submitting a proposal for a grant.
Connecting to the cluster
To access the opteron cluster, use ssh to connect to either bioinformatics.psc.edu or codon.psc.edu. You must use the ssh2 protocol.
Cluster passwords
Your cluster password is your PSC Kerberos password. You have the same password on all the cluster nodes. Use the kpasswd command on the front end node to change your password. Do not use the passwd command to change your password. You have the same password on all PSC production systems. If you change your password on one system using kpasswd it will change on all PSC production systems.
Accounting
Checking your usage
Accounting information for grants is also available at the PSC Grant Management System on the Web at https://grants.psc.edu/arms.
You will need your PSC Kerberos password to access this system. This system can provide more detailed information than xbanner, although some of the information is only available to grant PIs. The system has extensive internal documentation.
File Systems
There are four file areas on the cluster.
- User home directories - /usr/users/n/username
- This is your home directory on the cluster, where n is a single digit. All the opteron cluster nodes and bioinformatics.psc.edu share this NFS file system. The total space for all users is 4 Tbytes. Quotas are not currently enforced; however, we ask you to police your own file usage on this system and we reserve the right to enforce quotas if the need arises.
- /scratch
- Each user has his own scratch space available. This space is shared among every node in the cluster. Scratch space is intended for temporary storage only and is not backed up.
- /local
- Temporary storage is available in /local on every compute node. Each node
has its own /local system; they are not shared between nodes. However,
on a given node, this directory
is shared by all users. You can make a personal directory under /local,
but be aware that /local is intended for temporary storage only and is
not backed up.
- HSM
- The PSC Hierarchical Storage Manager is used for archival storage
and is available from the Opteron cluster and also from BigBen, pople and salk.
The archiver runs on golem.psc.edu. The far (File ARchiver) interface includes the commands to store and retrieve files. More information on golem.psc.edu is available at http://www.psc.edu/general/filesys/far/.
We recommend that you store all of your important files in the HSM. Furthermore, far should only be used interactively, so that PEs are not tied up in a batch job waiting for file storage or retrieval.
Transferring Files to the Cluster
Either the secure copy program, scp, or far can be used to transfer files in and out of the cluster file space.
scp
When using scp to copy files into the cluster, connect to bioinformatics.psc.edu or codon.psc.edu.
The format for scp is:
scp source-filename target-filename
where the filename on the remote system (whether it is the target or the source) must be specified as:
username@system:filename
For example, to copy a file to the cluster when you are logged on to a different machine, type:
scp filename username@codon.psc.edu:filename
This command copies a file into your home directory.
To copy a file to the cluster from another system while logged into codon.psc.edu, type:
scp username@remote-system:filename filename
The first time you transfer files to/from another system, you will receive a message similar to:
Host key not found from list of known hosts. Are you sure you want to continue connecting (yes/no)?
Answer 'yes' to make the connection. The next time you connect to that host, you will not receive that message.
You will be prompted next for your password on the remote system. For more information on the scp command, see the scp man page.
scp is part of the ssh distribution.
far
You can also move files between the PSC HSM and the cluster. See the section on file systems for details.
Module software
The Module package provides for the dynamic modification of a users's environment via module files. Module can be used:
- to manage multiple versions of applications, tools and libraries
- to manage software where complex changes to the environment are necessary
- to manage software where name conflicts with other software would cause problems
Loading a module for a particular piece of software often adds the path to the executable to $PATH, the path to the library to $LD_LIBRARY_PATH, and so on. Loading a module, then, relieves the user of having to remember or look up and type long path names.
Some useful module commands are:
- module avail
- lists available modules
- module list
- lists currently loaded modules
- module help foo
- help on module foo
- module whatis foo
- brief description of module foo
- module display foo
- displays the changes that are made to the environment by loading module foo without actually loading it.
- module load foo
- load module foo
- module unload foo
- unloads module foo and removes all changes that it made in the environment.
- module clear
- unloads all modules
Compilers and Other Programming Tools
Gnu, Intel and Portland compilers are installed on the opteron front-ends.
Compiling a serial program
Compilation commands and modules to use for serial programs are as follows:
| Compiler | Module | Command |
|---|---|---|
| Portland C | pgi32 (32-bit binaries) pgi64 (64-bit binaries) |
pgcc prog.c |
| Portland C++ | pgi32 (32-bit binaries) pgi64 (64-bit binaries) |
pgCC prog.C |
| Portland f77 | pgi32 (32-bit binaries) pgi64 (64-bit binaries) |
pgf77 prog.f |
| Portland f90 | pgi32 (32-bit binaries) pgi64 (64-bit binaries) |
pgf90 prog.f90 |
| GNU C | None needed | gcc prog.c |
| GNU C++ | None needed | g++ prog.C |
| GNU f77 | None needed | f77 prog.f |
| Intel C | intelcc/8.1 intelcce/9.1 | icc prog.c | Intel C++ | intelcc/8.1 intelcce/9.1 | icpc prog.C |
| Intel Fortran | intelfc/8.1 intelfce/9.1 |
ifort prog.f (f77) ifort prog.f90 (f90) |
Compiling parallel programs with MPICH MPI
To compile a parallel program using MPICH MPI, choose the compiler you wish to use and load the modules for the compiler and the corresponding MPICH library.
The table shows which module to load, and which MPICH library works with each compiler.
| Compiler module | MPICH version | MPICH module | Elan enabled? |
|---|---|---|---|
| Portland Group | |||
| pgi64 (64-bit binaries) | 1 2 | mpich-1.2.7p1/pgi mpich2-1.0.3/pgi | No |
| Gnu | |||
| None needed | 1 | mpich-elan/1.24-47-gnu | Yes |
| Intel | |||
| intelcce/9.1 intelfce/9.1 | 2 | mpich2-1.0.3/intel | No |
| intelcce/9.1 intelfce/9.1 | 1 | mpich-elan/1.24-47-intel | Yes |
The commands to compile an MPICH program are the same regardless of the compiler type (GNU, Intel, Portland Group) used. Once the compiler and MPICH modules are loaded, use:
| Compiler | Command |
|---|---|
| C | mpicc |
| f77 | mpif77 |
| f90 | mpif90 |
Compiling 32- vs. 64-bit binaries
By default, compiling a program on codon produces 64-bit binaries. This can be a problem if the code is not "64-bit clean". Things that cause a code to not be "64-bit clean" include:
- casting a pointer to an integer and then back to a pointer
- incrementing pointers by an integer offset.
If your code is not "64-bit clean", recompile it to produce 32-bit binaries.
- Portland compilers
- Load the pgi32 module.
- GNU compilers
- Use the -m32 compilation flag. (Using -m64 produces 64-bit binaries.)
gcc -m32 -o executable myprog.c
- Intel compilers
- The Intel compilers produce only 32-bit binaries.
Third-party Software
For a complete list of installed third-party software, see the list of installed software by platform.
If you have a question about a software program that is not documented, contact us.
Running a Job on the Cluster
Batch access
The Simple Linux Utility for Resource Management (SLURM) scheduler controls all access to the cluster's compute nodes. Only one 'partition', or queue, exists.
SLURM commands are used to submit jobs, run executables and scripts, and manage compute jobs. A table showing PBS commands and their corresponding SLURM commands is given in the upgrade document, http://www.psc.edu/machines/opteron/OpteronUpgrade.html.
SLURM commands are (the links lead to the man pages):
- srun, queue and execute jobs
- scancel, signal or cancel jobs
- sinfo, view information about nodes or partitions
- smap, graphically view information about jobs and partitions and set configuration parameters
- squeue, view jobs in queue
- scontrol, view and modify configuration and state
Man pages for all SLURM commands are available on the opteron cluster also.
Scheduling policies
Jobs are executed in FIFO order. If a job cannot run because of insufficient resources, however, other jobs submitted subsequently can execute if there are sufficient resources for the later jobs.
There is no checkpointing for any job.
A running job does not share processors with other jobs.
The srun command
Use the srun command to submit a job script to SLURM. A job script consists of SLURM directives, comments, and executable statements. A discussion of job scripts for both serial and parallel jobs follows the description of the srun command.
The simplest version of the srun command is:
srun -b script.sh
Common options for the srun command are:
- -t minutes or --time=minutes
- This specifies the cpu time limit for the job in minutes.
- -N num_nodes or --nodes=num_nodes
- This specifies the minimum number of nodes the job will use. The job may be launched on more than num_nodes. A maximum number of nodes to use can be given as --nodes=min_nodes-max_nodes.
- -n processes or --ntasks=processes
- Requests that srun allocate processes processes. The default is one process per node, but the -c flag can change that.
- -o and -e
- specifies pathnames for standard output and error, respectively. If only -o is given, -e defaults to the same.
- -c ncpus or --cpus-per-task=ncpus
- Requests ncpus per process. The default is one.
- --mail-type=option
- specifies if and when mail is sent about job execution. If the
option is :
- FAIL, mail is sent when the job is aborted by the system
- BEGIN, mail is sent when the job begins execution
- END, mail is sent when the job terminates
- ALL, mail is sent when any of the above events occurs.
- --mail-user=username
- specifies the user to whom mail is sent about the job. If username is omitted, mail is sent to the job owner.
- -P job-id
- specifies that the current job may not begin until job job-id ends, with or without errors. No other types of dependencies can be specified.
Running a job
Follow these steps to run a job:
- Get your source code and data files to one of the cluster's file systems with far or scp or create them there.
- Log in to the front end node with ssh.
- Compile your program
- Create a script that executes your program and performs any other operations that your executable needs to run successfully.
- Make this script
executable with the chmod command.
chmod 755 yourscript
- Submit the script for execution with the
srun command.
srun -b yourscript
Batch output--your job's standard output and standard error output--is returned to the directory from which you issued the srun command when your job finishes.
Sample job script
A sample script for a job is:
#!/usr/bin/csh #SLURM -t 5 #SLURM -N 1 #SLURM -n 2 # No need to copy over files # All input data and the executable are in this directory srun ./yourprog
The first line in the script designates the shell to use. The next 3 lines in the script are SLURM directives. The first states that this job has a maximum CPU time limit of 5 minutes. The second SLURM directive indicates that you want to use one node, and the third asks for two processors.
The other 2 lines that begin with '#' are comments. The '#' for comments and SLURM directives must be in column one of your script file.
The remaining line in the script uses the srun command to runs your executable, which you should have previously compiled on bioinformatics or codon. You must use the full path to the executable. You could have other commands if you need to change directories, copy in files to the working directory, store output data to a different directory, etc.
You can also specify your SLURM directives as command options to srun. Thus, you could omit the SLURM directives in the sample script above and submit the script with
srun -t 5:00 -N 1 -n 1 -b yourscript
Using both CPUs with Non-threaded Serial Jobs
Each node on PSC's Opteron cluster has 2 CPUs with 4Gb of shared memory. When a non-threaded serial job runs, only a single CPU is used, and half of the node's computational capacity is wasted. Fortunately, if a single running instance of an executable consumes less than half of the available amount memory of per node (i.e., 2Gb), there is a very easy and straightforward way to use both CPUs in a node and hence effectively double throughput.
Assume a serial job is submitted using the script run_1.sh with the command line:
[me@codon] srun -N1 -b -p nolimit run_1.sh.
To run two executables per node, simply pick two serial jobs with their corresponding scripts, run_1.sh and run_2.sh, say, and then generate a new script called, e.g., run_both_jobs.sh, with the following content:
#!/bin/bash # Use '&' to move the first job to the background ./run_1.sh & ./run_2.sh # Use 'wait' as a barrier to collect both executables when they are done. wait
The scripts could, of course, be in different directories (just change the executable name to include the path, e.g., ./run_1.sh to /path/to/run_1.sh). They both also need to be made executable with, e.g., chmod u+x run_1.sh. Submit the new queue script via:
[me@codon] srun -N1 -b -p nolimit run_both_jobs.sh.
Both run_1.sh and run_2.sh will now happily run on a single node, thereby maximizing throughput. Of course, once one of the two jobs is finished, the node will again only run the remaining single job until it is finished as well. Hence, it is useful to collect pairs of jobs with approximately equal runtimes, if possible.
Sequence Analysis Jobs
Sequence analysis users will find it convenient to use the /biomed/lib/examples/.login.fixed login file. It adds to your default path directories where many sequence analysis packages are stored, sets up the environment for EMBOSS, and defines some environment variables for sequence analysis packages.
login.fixed also contains other commands not specific to sequence analysis that set the prompt and define the terminal type.
Debugging
The GDB debugger is available on the opteron cluster.
To use GDB, set SHELL to be /bin/bash or another standard shell, since the /usr/psc/shells are incompatible with GDB. For instance, if your login shell is /usr/psc/shells/bash, add the line
export SHELL=/bin/bash
to both your .bash_profile and .bashrc files.
To generate code that GDB can read, use the -g option when compiling, and the gnu compiler.
gcc -g prog.c
Compiling with optimization on can cause complications for debugging. Often, the values of local variables are lost. To avoid this, turn off optimization when compiling for debugging purposes:
GDB reads a core file. By default, a core file is NOT written when a program ends unsuccessfully on the opteron cluster.
To have a core file created, use the ulimit (sh) or unlimit (csh) command:
ulimit -c unlimited unlimit coredumpsize
This command can be part of the job script, or invoked in the shell before the job script is submitted.
Once the core file is created, invoke GDB with:
gdb executable corefile
Typing help once gdb starts gives extensive information on gdb usage and commands. A man page is also available.
Debugging a parallel program
To debug a parallel MPI program using gdb, first enable the cluster nodes to access to your X server with:
xhost +operon01 xhost +operon02 ...
on the workstation for each of the cluster nodes that the parallel program could potentially run on.
Then start the program, with each debugged process having its own window, using:
srun -b -nprocesses /bin/env DISPLAY=display_name xterm -e gdb prog
For example,
srun -b -n4 /bin/env DISPLAY=workstation.psc.edu:0 xterm -e gdb ./myprog
will create 4 windows on workstation.psc.edu's display, each containing a gdb debugger in which the gdb "run" command will start up myprog.
If a program takes many arguments, or there are several initial commands to give to gdb, avoid typing them in each window by using gdb's -x option to read them out of a file:
srun -b -nprocesses /bin/env DISPLAY=display_name xterm -e gdb -x cmds yourprog
wher the file cmds might contain something like:
break 19 break 430 break 1027 run arg1 arg2 arg3
Other SLURM commands
squeue
The squeue command is used to display the status of queued jobs. Use the -f option for a more extensive status listing. The -u username option displays the status of jobs user username is running.
codon.psc.edu> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2064 all go.sh jones R 1-01:53:01 7 operon[13-18,20] 2222 all alignace smith R 0:03 1 operon11 2223 all fasta.sh black R 0:04 1 operon08
sinfo
The sinfo command displays information about nodes and partitions.
codon.psc.edu> sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all* up infinite 1 down* operon10 all* up infinite 7 alloc operon[13-18,20] all* up infinite 10 idle operon[01-08,11-12] devel up infinite 2 idle bioinformatics,codon papi up infinite 1 down operon19
scancel
The scancel command is used to signal or cancel jobs.
codon> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2064 all go.sh smp R 1-02:23:14 7 operon[13-18,20] 2224 all seqanal nigra R 0:12 1 operon11 2225 all seqsort nigra R 0:06 1 operon12 2226 all amber nigra R 0:02 1 operon01 codon> scancel 2225 codon> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2064 all go.sh smp R 1-02:23:28 7 operon[13-18,20] 2224 all seqanal. nigra R 0:26 1 operon11 2226 all amber.sh nigra R 0:16 1 operon01 codon> scancel -u nigra codon> squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2064 all go.sh smp R 1-02:23:36 7 operon[13-18,20]
scontrol
The scontrol command is used to view and alter configurations for queued and running jobs.
codon> scontrol show nodes NodeName=operon01 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon02 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon03 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon04 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon05 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon06 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon07 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon08 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon09 State=DOWN* CPUs=2 RealMemory=3016 TmpDisk=15748 Weight=1 Features=(null) Reason=<-----Maintenance-----> NodeName=operon10 State=DOWN* CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=Not responding [slurm@06/29-12:36:58] NodeName=operon11 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon12 State=IDLE CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon13 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon14 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon15 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon16 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon17 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon18 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=operon19 State=DOWN CPUs=2 RealMemory=2007 TmpDisk=15748 Weight=1 Features=(null) Reason=Low RealMemory [slurm@06/27-00:10:15] NodeName=operon20 State=ALLOCATED CPUs=2 RealMemory=3017 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=codon State=IDLE CPUs=2 RealMemory=6961 TmpDisk=15748 Weight=1 Features=(null) Reason=(null) NodeName=bioinformatics State=IDLE CPUs=2 RealMemory=6960 TmpDisk=15748 Weight=1 Features=(null) Reason=(null)
See the man pages for these commands for additional information.
SLURM documentation
SLURM documentation is available online at http://www.llnl.gov/linux/slurm/.