- System Configuration
- Access to Axon
- Storing Files
- Transferring files
- Creating Programs
- Running Jobs
- Monitoring and Killing Jobs
- Software Packages
- The Module Command
- Stay Informed
- Reporting a Problem
Axon is a 32-node cluster. Each node is a Supermicro 6015W-INFB SuperServer and has 2 Intel Xeon E5420 2.5 GHz quad-core processors, for a total of 256 cores on the machine. The 8 cores on a node share 8 Gbytes of memory, although 4 of the nodes have 16 Gbytes . The nodes are interconnected by an SDR InfiniBand communications link. Axon runs a version of the CentOS Linux operating system.
There are multiple frontend nodes, which are also Supermicro 6015W-INFB SuperServers with Intel Xeon E5420 processors and which run the same version of CentOS Linux as the compute nodes. You login to one of these frontend nodes, not to the compute nodes.
GNU and Intel C, C++ and Fortran compilers are installed on axon, as are the facilities to enable you to run MPI programs.
Access to Axon
Getting an account on axon
The axon cluster is for internal use by PSC's Biomedical Applications Group and their collaborators.
Connecting to axon
To connect to axon you must ssh to axon.psc.edu. When you are prompted for a password enter your PSC Kerberos password.
Changing your password
There are two ways to change or reset your PSC Kerberos password:
- Use the web-based PSC password change utility
- Use the kpasswd command to change your PSC Kerberos password. Do not use the passwd command.
You have the same password on all PSC production platforms. When you change your password, whether you do it via the online utility or via the kpasswd command on one PSC system, you change it on all PSC systems.
PSC Kerberos passwords must be at least 8 characters in length. They must also contain characters from at least 3 of the character classes:
- lower-case letters
- upper-case letters
- special characters, excluding ' and "
Finally, they must not be the same as any of your previous passwords.
Changing your login shell
You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.
File systems are file storage spaces directly connected to a system. There are currently four such areas available to you on axon.
This is your home directory. $HOME is visible to all of axon's compute and frontend nodes. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to the Data Supercell (patent pending). The Data Supercell, PSC's file archival system, is discussed below.
There are three large, fast, parallel file storage spaces on axon, which you can use as working space and scratch space. Together, $HOME, $WORK and $SCRATCH can store 30 Tbytes of data, while $BESSEMER can store 140 Tbytes of data. Access to $WORK and $SCRATCH will be faster than access to $BESSEMER. All three spaces are visible to all of axon's frontend and compute nodes. You should use the variable names to refer to these file spaces, since we could change the underlying paths to them.
None of these areas should be considered as permanent file storage space. None of the three areas is backed up. In addition, files can remain on $SCRATCH and $BESSEMER for only up to 7 days and then they are deleted. Thus, you should store copies of files you store on these storage areas to your local site or to the Data Supercell as soon as you can after you create them. The Data Supercell, PSC's file archival system, is discussed below.
If your program reads or write large files you should use $WORK, $SCRATCH or $BESSEMER. The $WORK, $SCRATCH and $BESSEMER file spaces are implemented using the Lustre parallel file system. A program that uses $WORK, $SCRATCH or $BESSEMER can perform parallel IO and thus can significantly improve its performance. File striping can be used to tune your parallel IO performance and is particularly effective for files that are 1 Gbyte or larger.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. This is how you can use Lustre as a parallel file system. $WORK and $SCRATCH have 4 OSTs, while $BESSEMER has 24 OSTs.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as ls -l or unlink will take longer for striped files. Also, striped files are more at risk for data loss due to hardware failure. If a file is spread across several OSTs a hardware failure of any of them will result in the loss of part of the data in that file. You may choose to lose all of a small number of files rather than parts of all of a large number of your files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is
lfs setstripe filename stripe-size OST-start stripe-count
We recommend that you always set the stripe size parameter to 0 and the starting OST parameter to -1. This will result in the default stripe size of 1 Mbyte and assign your starting OST in a round-robin fashion. A value of -1 for the stripe count means the file should be spread across all the available OSTs.
For example, the command
lfs setstripe bigfile.out 0 -1 -1
sets the stripe count for bigfile.out to be all available OSTs.
lfs setstripe manyfiles.out 0 -1 1
has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application. However, we do recommend that you use a stripe count that is less than 8.
File repositories are file storage spaces which are not directly connected to a frontend or compute processor. You cannot, for example, open a file that resides in a file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on axon: the Data Supercell (patent pending), PSC's disk-based archival system.
GNU and Intel C, C++ and Fortran compilers are installed on axon and they can be used to create MPI and serial programs. The commands you should use to create your programs are shown in the table below.
|GNU Fortran||mpif90 mympi.f90||gfortran myserial.f90|
|GNU C||mpicc mympi.c||gcc myserial.c|
|GNU C++||mpicxx mympi.C||g++ myserial.C|
|Intel Fortran||mpif90 mympi.f90||ifort myserial.f90|
|Intel C||mpicc mympi.c||icc myserial.c|
|Intel C++||mpicxx mympi.C||icpc myserial.C|
To use the Intel C and C++ compilers for either MPI or serial programs you must first issue the command
module load intel/11.081c
To use the Intel Fortran compiler for either MPI or serial programs you must first issue the command
module load intel/11.081fort
The flavor of MPI on axon is OpenMPI. This version will use the InfiniBand interconnect for inter-process communication.
To use the Intel C compiler to create MPI programs you must first set the variable MPICH_CC to the value "icc". To use the Intel C++ compiler to create MPI programs you must first set the variable MPICH_CXX to "icpc". To use the Intel Fortran compiler to create MPI programs you must first set the variable MPICH_F90 to "ifort".
Man page for the GNU compilers are available with the commands man gfortran, man gcc and man g++. Once you load the appropriate module the man pages for ifort, icc, and icpc are available.
The Portable Batch System (PBS) controls all acess to axon's compute nodes, for both batch and interactive and parallel and serial jobs.
The queue structure has been updated to reflect the resources needed for a job. Queues are defined either by physical memory capacity or CUDA compute capabilities.
Jobs will default to the gb16 queue if no queue is explicitly specified. The former queue names (batch, bigmem and gpu) are deprecated and will eventually be retired.
|Queue||Memory per node||Cores per node||CPU||CUDA capability||CUDA cores per node||Max. nodes per job|
|gb16||16GB||8||E5420 (2.5 GHz)||none||-||28|
|gb32||32GB||8||E5420 (2.5 GHz)||none||-||4|
|gb64||64GB||12||E5-2630v2 (2.6 GHz)||none||-||5*|
|cc20||64GB||12||E5-2630v2 (2.6 GHz)||2.0||1792 (4 x Tesla S2070)||2|
|cc30||64GB||12||E5-2630v2 (2.6 GHz)||3.0||1536 (8 x Quadro K600)||2|
* At times only 2-4 of these nodes may be available.
Interactive jobs run in the batch queue and the method for doing so is discussed below.
Sample MPI batch job
To run a batch job on axon you submit a batch script to the scheduler. A job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.
A sample job script to run an MPI program is
#!/bin/csh #PBS -l nodes=2:ppn=8 #PBS -l walltime=5:00 #PBS -j oe #PBS -q gb16 set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #run my executable mpirun ./mympi
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command in your batch script, then you must include the -l option to your shell command.
The four #PBS lines are PBS directives.
- #PBS -l nodes=2:ppn=8
This directive, along with the -np option to the mpirun command, determines how your processess are allocated across your nodes. The value of nodes indicates the total number of nodes to allocate to your job. The value of nodes must be between 1 and 32. The value of ppn indicates the number of processes to allocate on a node before moving on to the allocation of processes on your next node. The value of ppn must be between 1 and 8. The value of the -np option to mpirun is the total number of processes to allocate. The default value for -np is your value for nodes times your value for ppn. You will probably often use the default value for -np.
For example, suppose you want to allocate 16 processes on 2 nodes in a block manner, which means your first 8 processes are allocated to your first node and your second 8 processes are allocated to your second node. Then you would use the nodes and ppn values given in the sample script and omit the -np option to mpirun.
However, if you want to allocate these 16 processes in a cyclic manner then you would use the PBS specification
#PBS -l nodes=2:ppn=1
and you would give the -np option to mpirun a value of 16. This would allocate your first process to your first node, your second process to your second node, your third process to your first node, your fourth process to your second node, and so on, until all 16 processes are allocated. You must use the -np option to mpirun or the system will think you only want to allocate 2 processes.
You may want to allocate fewer than 8 processes per node so you have fewer processes dividing up the 8 or 16 Gbytes of memory available on a node. For example, the PBS specification
#PBS -l nodes=2:ppn=4
would allocate only 4 processes for each of your two nodes, if you do not use the -np option to mpirun. Since jobs do not share nodes, you will still pay for the entire node even though you are not using all 8 cores on the node, but you do have access to the entire memory on the node.
- #PBS -l walltime=5:00
The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.
- #PBS -q gb16
The final PBS directive requests that your job be run in the gb16 queue.
The remaining lines in the script are comments and command lines.
- set echo
This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use
- Comment lines
The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.
- mpirun ./mympi
This command launches your executable on axon's compute nodes. You must use mpirun to run your MPI executable or it will run on a frontend node and degrade overall system performance.
Sample serial batch job
A sample script to run a serial job is
#!/bin/csh #PBS -l nodes=1 #PBS -l walltime=5:00 #PBS -j oe #PBS -q gb16 set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myserial . #run my executable ./myserial
After you create your batch script you submit it to PBS with the qsub command.
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample script and submit the script with the command
qsub -l nodes=2:ppn=8 -l walltime=5:00 -j oe -q gb16 myscript.job
Command-line directives override directives in your scripts.
A form of interactive access is available on axon by using the -I option to qsub. For example, the command
qsub -I -q gb16 -l nodes=2:ppn=8 -l walltime=5:00
requests interactive access to 16 cores for 5 minutes in the gb16 queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
Other qsub options
Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on axon.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- Indicates to which charge_id you want a job to be charged. If you only have one grant on axon you do not need to use this option; otherwise, you should charge each job to the appropriate grant.
You can see your valid charge_ids by typing
groupsat the axon prompt. Typical output will look like
sy2be6n ec3l53p eb3267p jb3l60q
Your default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify
-W group_listfor your job, this is the grant that will be charged.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins
Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Monitoring and Killing Jobs
The qstat -a command display the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.
The qdel command is used to kill queued and running jobs. An example is the command
The Module Command
To run many software packages paths and other variables must often first be set up. To change versions of a package these definitions must often be modified. The module command makes these processes easier. For use of the module command, including its use in batch jobs, see the module documentation.
As a user of axon, it is imperative that you stay informed of changes to the machine's enviroment. Refer to this document frequently.
You will also periodically receive email from PSC with information about axonn. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.
Reporting a Problem
You have two options for reporting problems on axon.
You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.