- System Configuration
- Access to Warhol
- Storing Files
- Transferring files
- Creating Programs
- Running Jobs
- Monitoring and Killing Jobs
- Software Packages
- The Module Command
- Stay Informed
- Acknowledgement in Publications
- Reporting a Problem
Warhol is an 8-node Hewlett-Packard BladeSystem c3000. Each node has 2 Intel E5440 quad-core 2.83 GHz processors, for a total of 64 cores on the machine. The 8 cores on a node share 16 Gbytes of memory. The nodes are interconnected by an InfiniBand communications link. Warhol runs a version of CentOS Linux operating system.
There are multiple frontend nodes, which are also Intel E5440 processors and which run the same version of CentOS Linux as the compute nodes. You login to one of these frontend nodes, not to the compute nodes.
GNU and Intel C, C++ and Fortran compilers are installed on warhol, as are the facilities to enable you to run MPI and OpenMP programs.
Access to Warhol
Connecting to warhol
To connect to warhol you must ssh to warhol.psc.edu. When you are prompted for a password enter your PSC Kerberos password.
Changing your password
There are two ways to change or reset your PSC Kerberos password:
- Use the web-based PSC password change utility
- Use the kpasswd command to change your PSC Kerberos password. Do not use the passwd command.
You have the same password on all PSC production platforms. When you change your password, whether you do it via the online utility or via the kpasswd command on one PSC system, you change it on all PSC systems.
PSC Kerberos passwords must be at least 8 characters in length. They must also contain characters from at least 3 of the character classes:
- lower-case letters
- upper-case letters
- special characters, excluding ' and "
Finally, they must not be the same as any of your previous passwords.
Changing your login shell
You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.
Accounting on warhol
One core-hour on warhol is one SU.
User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.
Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to grants PIs. The system has extensive internal documentation.
File systems are file storage spaces directly connected to a system. There are currently two such areas available to you on warhol.
This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of warhol's compute and frontend nodes. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to the Data Supercell*. The Data Supercell, PSC's file archival system, is discussed below.
This is warhol's scratch area to be used as a working space for your running jobs. This area has 4 Tbytes of space. $SCRATCH is visible to all of warhol's compute and frontend nodes. You should use the name $SCRATCH to refer to your scratch area since we may change its implementation.
$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 7 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to the Data Supercell as soon as you can after you create them. The Data Supercell, PSC's file archival system, is discussed below.
File repositories are file storage spaces which are not directly connected to a frontend or compute processor. You cannot, for example, open a file that resides in a file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on warhol: the Data Supercell, PSC's file archival system.
- The Data Supercell (patent pending)
The Data Supercell is a complex disk-based archival system.
GNU and Intel C, C++ and Fortran compilers are installed on warhol and they can be used to create MPI and OpenMP programs. The commands you should use to create your programs are shown in the table below.
|GNU Fortran||mpif90 mympi.f90||gfortran -fopenmp myopenmp.f90||mpif90 -fopenmp myhybrid.f90||gfortran myserial.f90|
|GNU C||mpicc mympi.c||gcc -fopenmp myopenmp.c||mpicc -fopenmp myhybrid.c||gcc myserial.c|
|GNU C++||mpiCC mympi.C||g++ -fopenmp myopenmp.C||mpiCC -fopenmp myhybrid.C||g++ myserial.C|
|Intel Fortran||mpif90 mympi.f90||ifort -openmp myopenmp.f90||mpif90 -openmp myhybrid.f90||ifort myserial.f90|
|Intel C||mpicc mympi.c||icc -openmp myopenmp.c||mpicc -openmp myhybrid.c||icc myserial.c|
|Intel C++||mpiCC mympi.C||icpc -openmp myopenmp.C||mpiCC -openmp myhybrid.C||icpc myserial.C|
Three flavors of MPI are available on warhol: OpenMPI, MVAPICH and MVAPICH2. Which flavor you use is determined by which MPI module you have loaded. The default module is the openmpi_gcc module, which is for creating OpenMPI programs using the GNU compilers. If you want to use OpenMPI and the Intel compilers you should issue the command
module swap openmpi_gcc openmpi_intel
before you build your executable. If you want to use MVAPICH you should issue the command
module swap openmpi_gcc mvapich_gcc
module swap openmpi_gcc mvapich_intel
depending on whether you want to use the GNU or Intel compilers. If you want to use MVAPICH2 you should issue the command
module swap openmpi_gcc mvapich2_gcc
module swap openmpi_gcc mvapich2_intel
depending on whether you want to use the GNU or Intel compilers.
We have found that MVAPICH and MVAPICH2 perform better than OpenMPI for some applications. You should try all three flavors to see which performs best for you. MVAPICH2 supports the MPI-2 additions to the MPI standard.
The commands to create MPI programs are wrapper commands. You do not execute the compilers directly. To run the Intel Fortran compiler directly you must first issue the command
module load ifort
To run the Intel C or C++ compilers directly you must first issue the command
module load icc
We have found that many programs run more efficiently if compiled with the Intel compilers. We recommend that you try both types of compilers and see which produces faster code for your application. The Intel compilers can only be run on warhol's login nodes.
Man page for the GNU compilers are available with the commands man gfortran, man gcc and man g++. Once you load the appropriate module the man pages for ifort, icc, and icpc are available.
The Portable Batch System (PBS), controls all access to warhol's compute nodes, for both batch and interactive jobs. Currently warhol has two queues: the batch queue and the debug queue. The debug queue is always on. Interactive jobs can run in the debug and batch queues and the method for doing so is discussed below.
The batch queue is currently a strictly FIFO queue. The maximum time limit is 48 hours and the maximum number of cores you can request is 56. The debug queue is also a strictly FIFO queue. Its maximum time limit is 30 minutes and the maximum number of cores you can request is 8.
Sample MPI batch job
To run a batch job on warhol you submit a batch script to the scheduler. A job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.
A sample job script to run an MPI program is
#!/bin/csh #PBS -l nodes=2:ppn=8 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #run my executable mpirun ./mympi
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command in your batch script, then you must include the -l option to your shell command.
The four #PBS lines are PBS directives.
- #PBS -l nodes=2:ppn=8
This directive, along with the -np option to the mpirun command, determines how your processess are allocated across your nodes. The value of nodes indicates the total number of nodes to allocate to your job. The value of nodes must be between 1 and 8. The value of ppn indicates the number of processes to allocate on a node before moving on to the allocation of processes on your next node. The value of ppn must be between 1 and 8. The value of the -np option to mpirun is the total number of processes to allocate. The default value for -np is your value for nodes times your value for ppn. You will probably often use the default value for -np.
For example, suppose you want to allocate 16 processes on 2 nodes in a block manner, which means your first 8 processes are allocated to your first node and your second 8 processes are allocated to your second node. Then you would use the nodes and ppn values given in the sample script and omit the -np option to mpirun.
However, if you want to allocate these 16 processes in a cyclic manner then you would use the PBS specification
#PBS -l nodes=2:ppn=1
and you would give the -np option to mpirun a value of 16. This would allocate your first process to your first node, your second process to your second node, your third process to your first node, your fourth process to your second node, and so on, until all 16 processes are allocated. You must use the -np option to mpirun or the system will think you only want to allocate 2 processes.
You may want to allocate fewer than 8 processes per node so you have fewer processes dividing up the 8 Gbytes of memory available on a node. For example, the PBS specification
#PBS -l nodes=2:ppn=4
would allocate only 4 processes for each of your two nodes, if you do not use the -np option to mpirun. Since jobs do not share nodes, you will still pay for the entire node even though you are not using all 8 cores on the node, but you do have access to the entire memory on the node.
- #PBS -l walltime=5:00
The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.
- #PBS -q batch
The final PBS directive requests that your job be run in the batch queue. To request the debug queue you would replace 'batch' by 'debug'.
The remaining lines in the script are comments and command lines.
- set echo
This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use
- Comment lines
The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.
- mpirun ./mympi
This command launches your executable on warhol's compute nodes. You must use mpirun to run your MPI executable or it will run on a frontend node and degrade overall system performance.
Sample MVAPICH2 MPI batch job
If you are using the MVAPICH2 flavor of MPI you must make some modifications to the above script, whether you are using the GNU or Intel compilers to create your MPI executable. A sample revised script is
#!/bin/csh #PBS -l nodes=2:ppn=8 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #start the MPI daemon module swap openmpi_gcc mvapich2_intel mpdboot -n 2 -f $PBS_NODE_FILE mpdtrace #run my executable mpirun ./mympi #shut down MPI daemon mdpallexit
The extra commands before and after your mpirun command are necessary to start and shut down the MVAPICH2 MPI daemon. The value of the -n option to mpdboot command must be the number of nodes you requested with your PBS nodes specification.
In addition, before you submit your job you must create a file named .mdp.conf in your home directory. The contents of the file should be
For 'XXXXXXXX' you can substitute any string.
Sample OpenMP batch job
A sample job script to run an OpenMP program is
#!/bin/csh #PBS -l nodes=1 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myopenmp . #set number of OpenMP threads setenv OMP_NUM_THREADS 8 #run my executable ./myopenmp
You can only run an OpenMP program on warhol on one node.
Sample hybrid batch job
A sample script to run a hybrid MPI and OpenMP program is
#!/bin/csh #PBS -l nodes=4:ppn=1 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myhybrid . #set number of OpenMP threads setenv OMP_NUM_THREADS 8 #run my executable mpirun ./myhybrid
This job assumes you have an MPI program wtih 4 ranks. The job will distribute one rank per node. Each rank will generate 8 OpenMP threads. The OpenMP threads cannot communicate across nodes, but the MPI ranks can.
If you want to run on a different number of nodes or have a different number of ranks per node or generate a different number of OpenMP threads per node, you must adjust the values of nodes, ppn and OMP_NUM_THREADS. For example, a specification of
#PBS -l nodes=2:ppn=2
combined with setting OMP_NUM_THREADS as follows
setenv OMP_NUM_THREADS 4
is suitable for a program that runs 2 MPI ranks on each of 2 nodes, with 4 OpenMP threads being generated on each node.
After you create your batch script you submit it to PBS with the qsub command.
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample script and submit the script with the command
qsub -l nodes=2:ppn=8 -l walltime=5:00 -j oe -q batch myscript.job
Command-line directives override directives in your scripts.
A form of interactive access is available on warhol by using the -I option to qsub. For example, the command
qsub -I -q debug -l nodes=2:ppn=8 -l walltime=5:00
requests interactive access to 8 cores for 5 minutes in the debug queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
Other qsub options
Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on warhol.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins
Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Sample serial job and serial job packing
A sample script to run a serial job is
#!/bin/csh #PBS -l nodes=1 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myserial . #run my executable ./myserial datafile1
A possible problem with this script is that you are only running one execution on your node, but you are going to pay for the entire node. If you do not need all the memory on the node for this one execution, you may want to pack more executions on this one
#!/bin/csh #PBS -l nodes=1 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myserial . #run my executables numactl --physcpubind=0 ./myserial input1 output1& numactl --physcpubind=1 ./myserial input2 output2& numactl --physcpubind=2 ./myserial input3 output3& numactl --physcpubind=3 ./myserial input4 output4& numactl --physcpubind=4 ./myserial input5 output5& numactl --physcpubind=5 ./myserial input6 output6& numactl --physcpubind=6 ./myserial input7 output7& numactl --physcpubind=7 ./myserial input8 output8& wait
will run 8 executions on your single node. The numactl command is necessary to ensure that each execution runs on its own core.
Packing MPI jobs
You may also want to pack multiple MPI runs in a single job, either so you do not waste cores or for convenience. For example, suppose you have 4 MPI runs that each use 2 cores. You could run each on its own node, but then you would waste 6 cores per node--and you have to pay for those cores whether you use them or not--or you could execute all 4 runs on a single node.
The following script will do this.
#!/bin/csh #PBS -l nodes=1:ppn=8 #PBS -l walltime=30:00 #PBS -j oe #PBS -q batch # set the number of cores each run will use # this number times your number of MPI runs must equal # the total cores requested set nc = 2 # create a machinefile for each MPI run # nodes_00 will be the first machinefile, nodes_01 will be the second and so on split -l $nc -d $PBS_NODEFILE nodes_ #run your MPI programs with mpirun mpirun -np $nc -machinefile nodes_00 affinity ./a.out input1 output1 & mpirun -np $nc -machinefile nodes_01 affinity ./a.out input2 output2 & mpirun -np $nc -machinefile nodes_02 affinity ./a.out input3 output3 & mpirun -np $nc -machinefile nodes_03 affinity ./a.out input4 output4 & wait
A machinefile lists all the cores assigned to an execution. The split command creates from the machinefile for your entire job, which is created by the system, a machinefile for each of your MPI runs. The affinity command insures that each MPI run is fixed throughout its execution to the cores to which it is initially assigned. Then each of your runs will run on its own set of cores. This example, and the examples below, assume that each of your runs uses the same number of cores. You will have to modify this script appropriately if that is not in the case in your situation.
This method will also work if you are spreading your MPI runs across more than one node. In this next example you have 4 4-core runs and you want to spread them across 2 nodes.
#!/bin/csh #PBS -l nodes=2:ppn=8 #PBS -l walltime=30:00 #PBS -j oe #PBS -q batch # set the number of cores each run will use # this number times your number of MPI runs must equal # the total cores requested set nc = 4 # create a machinefile for each MPI run # nodes_00 will be the first machinefile, nodes_01 will be the second and so on split -l $nc -d $PBS_NODEFILE nodes_ #run your MPI programs with mpirun mpirun -np $nc -machinefile nodes_00 affinity ./a.out input1 output1 & mpirun -np $nc -machinefile nodes_01 affinity ./a.out input2 output2 & mpirun -np $nc -machinefile nodes_02 affinity ./a.out input3 output3 & mpirun -np $nc -machinefile nodes_03 affinity ./a.out input4 output4 & wait
Like the previous job, this job has 4 MPI runs, but they are spread across 2 nodes. In neither case do you waste cores.
In the final example, you have 2 16-core runs and you want to pack them in a single job for convenience.
#!/bin/csh #PBS -l nodes=4:ppn=8 #PBS -l walltime=30:00 #PBS -j oe #PBS -q batch # set the number of cores each run will use # this number times your number of MPI runs must equal # the total cores requested set nc = 16 # create a machinefile for each MPI run # nodes_00 will be the first machinefile, nodes_01 will be the second split -l $nc -d $PBS_NODEFILE nodes_ #run your MPI programs with mpirun mpirun -np $nc -machinefile nodes_00 affinity ./a.out input1 output1 & mpirun -np $nc -machinefile nodes_01 affinity ./a.out input2 output2 & wait
Each run will run on 2 nodes, but you need submit only 1 job.
Monitoring and Killing Jobs
The qstat -a command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.
The qdel command is used to kill queued and running jobs. An example is the command
The Module Command
To run many software packages paths and other variables must often first be set. To change versions of a package these definitions must often be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see the module document.
As a user of warhol, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently.
You will also periodically receive email from PSC with information about warhol. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.
Acknowledgement in Publications
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on warhol be sent to the PSC Allocations Coordinator. We also request that you include an acknowledgement of PSC in your publication.
Reporting a Problem
Contact PSC User Services to report any problems on Warhol.