Warhol

Warhol

System Configuration

Hardware

Warhol is an 8-node Hewlett-Packard BladeSystem c3000. Each node has 2 Intel E5440 quad-core 2.83 GHz processors, for a total of 64 cores on the machine. The 8 cores on a node share 16 Gbytes of memory. The nodes are interconnected by an InfiniBand communications link. Warhol runs a version of CentOS Linux operating system.

There are multiple frontend nodes, which are also Intel E5440 processors and which run the same version of CentOS Linux as the compute nodes. You login to one of these frontend nodes, not to the compute nodes.

Software

GNU and Intel C, C++ and Fortran compilers are installed on warhol, as are the facilities to enable you to run MPI and OpenMP programs.

Access to Warhol

Getting an account on warhol

Warhol is available to academic researchers in Pennsylvania, as well as private sector and government researchers, even those not in Pennsylvania. If you are affilicated with an academic institution located in Pennsylvania, information about applying for a warhol grant is available online. If you are a government or private sector researcher, send email to corp-relations@psc.edu to inquire about getting an account on warhol.

Connecting to warhol

To connect to warhol you must ssh to warhol.psc.edu. When you are prompted for a password enter your PSC Kerberos password.

Changing your password

Use the kpasswd command to change your PSC Kerberos password, not the passwd command. You have the same password on all PSC production platforms. If you change your password on one PSC system using kpasswd you change it on all other PSC systems.

PSC Kerberos passwords must be at least 8 characters in length. They must also contain characters from at least 3 of the character classes:

  1. lower-case letters
  2. upper-case letters
  3. digits
  4. special characters, excluding ' and "

Finally, they must not be the same as any of your previous passwords.

You must change your warhol password within 30 days of the date on your initial password form or your password will be disabled. We will also disable your password if you do not change it at least once a year. We will send you an email notice warning you that your password is about to be disabled in the latter case. See the PSC password policies for more information. If your password is disabled send email to remarks@psc.edu to have it reset.

Changing your login shell

You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.

Accounting on warhol

One core-hour on warhol is one SU.

If you have more than one account, use the qsub option -W group_list to indicate to which account you want a job to be charged. The use of this option is discussed in the "Other qsub options" subsection of this document. To change your default account you must send email to remarks@psc.edu with this request.

User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.

Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to grants PIs. The system has extensive internal documentation.

Storing Files

File Systems

File systems are file storage spaces directly connected to a system. There are currently two such areas available to you on warhol.

$HOME

This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of warhol's compute and frontend nodes. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to golem. Golem, PSC's file archival system, is discussed below.

$SCRATCH

This is warhol's scratch area to be used as a working space for your running jobs. This area has 4 Tbytes of space. $SCRATCH is visible to all of warhol's compute and frontend nodes. You should use the name $SCRATCH to refer to your scratch area since we may change its implementation.

$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 7 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to golem as soon as you can after you create them. Golem, PSC's file archival system, is discussed below.

File Repositories

File repositories are file storage spaces which are not directly connected to a frontend or compute processor. You cannot, for example, open a file that resides in a file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on warhol: golem, PSC's file archival system.

golem

Golem is a combination tape-and-disk archival system. The far program should be used to tranfer files between golem and warhol. You should transfer files between golem and warhol outside of your batch jobs. Otherwise your jobs will be holding compute processors while your files are being transferred. You can use scp or kftp to transfer files between golem and your remote machine. If you need to store a file to golem that is 2 Tbytes or larger or if you are going to store more than 500 Gbytes of data in a day send email to remarks@psc.edu so that special arrangements can be made to store your files.

Transferring Files

You can use either the scp or the kftp program to transfer files between your remote machine and warhol and between your remote machine and golem. Which method will perform better varies based on location. Therefore you should try both approaches and see which performs better for you. If you want assistance in improving the performance of your file transfers send email to remarks@psc.edu.

Creating Programs

GNU and Intel C, C++ and Fortran compilers are installed on warhol and they can be used to create MPI and OpenMP programs. The commands you should use to create your programs are shown in the table below.

MPI OpenMP Hybrid Serial
GNU Fortran mpif90 mympi.f90 gfortran -fopenmp myopenmp.f90 mpif90 -fopenmp myhybrid.f90 gfortran myserial.f90
GNU C mpicc mympi.c gcc -fopenmp myopenmp.c mpicc -fopenmp myhybrid.c gcc myserial.c
GNU C++ mpiCC mympi.C g++ -fopenmp myopenmp.C mpiCC -fopenmp myhybrid.C g++ myserial.C
Intel Fortran mpif90 mympi.f90 ifort -openmp myopenmp.f90 mpif90 -openmp myhybrid.f90 ifort myserial.f90
Intel C mpicc mympi.c icc -openmp myopenmp.c mpicc -openmp myhybrid.c icc myserial.c
Intel C++ mpiCC mympi.C icpc -openmp myopenmp.C mpiCC -openmp myhybrid.C icpc myserial.C

Three flavors of MPI are available on warhol: OpenMPI, MVAPICH and MVAPICH2. Which flavor you use is determined by which MPI module you have loaded. The default module is the openmpi_gcc module, which is for creating OpenMPI programs using the GNU compilers. If you want to use OpenMPI and the Intel compilers you should issue the command

    module swap openmpi_gcc openmpi_intel

before you build your executable. If you want to use MVAPICH you should issue the command

    module swap openmpi_gcc mvapich_gcc

or

    module swap openmpi_gcc mvapich_intel

depending on whether you want to use the GNU or Intel compilers. If you want to use MVAPICH2 you should issue the command

    module swap openmpi_gcc mvapich2_gcc

or

    module swap openmpi_gcc mvapich2_intel

depending on whether you want to use the GNU or Intel compilers.

We have found that MVAPICH and MVAPICH2 perform better than OpenMPI for some applications. You should try all three flavors to see which performs best for you. MVAPICH2 supports the MPI-2 additions to the MPI standard.

The commands to create MPI programs are wrapper commands. You do not execute the compilers directly. To run the Intel Fortran compiler directly you must first issue the command

    module load ifort

To run the Intel C or C++ compilers directly you must first issue the command

    module load icc

We have found that many programs run more efficiently if compiled with the Intel compilers. We recommend that you try both types of compilers and see which produces faster code for your application. The Intel compilers can only be run on warhol's login nodes.

Man page for the GNU compilers are available with the commands man gfortran, man gcc and man g++. Once you load the appropriate module the man pages for ifort, icc, and icpc are available.

Running Jobs

Queue structure

The Portable Batch System (PBS), controls all access to warhol's compute nodes, for both batch and interactive jobs. Currently warhol has two queues: the batch queue and the debug queue. The debug queue is always on. Interactive jobs can run in the debug and batch queues and the method for doing so is discussed below.

If you would like to make a suggestion about warhol's queue structure send email to remarks@psc.edu.

Scheduling policies

The batch queue is currently a strictly FIFO queue. The maximum time limit is 48 hours and the maximum number of cores you can request is 56. The debug queue is also a strictly FIFO queue. Its maximum time limit is 30 minutes and the maximum number of cores you can request is 8.

We will modify warhol's scheduling policies to meet user needs. If you have suggestions or comments about the scheduling policies on warhol or find that they do not meet your needs send email to remarks@psc.edu.

Sample MPI batch job

To run a batch job on warhol you submit a batch script to the scheduler. A job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.

A sample job script to run an MPI program is

#!/bin/csh
#PBS -l nodes=2:ppn=8
#PBS -l walltime=5:00              
#PBS -j oe
#PBS -q batch

set echo

#move to my $SCRATCH directory
cd $SCRATCH

#copy executable to $SCRATCH
cp $HOME/mympi .

#run my executable
mpirun ./mympi

The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command in your batch script, then you must include the -l option to your shell command.

The four #PBS lines are PBS directives.

#PBS -l nodes=2:ppn=8

This directive, along with the -np option to the mpirun command, determines how your processess are allocated across your nodes. The value of nodes indicates the total number of nodes to allocate to your job. The value of nodes must be between 1 and 8. The value of ppn indicates the number of processes to allocate on a node before moving on to the allocation of processes on your next node. The value of ppn must be between 1 and 8. The value of the -np option to mpirun is the total number of processes to allocate. The default value for -np is your value for nodes times your value for ppn. You will probably often use the default value for -np.

For example, suppose you want to allocate 16 processes on 2 nodes in a block manner, which means your first 8 processes are allocated to your first node and your second 8 processes are allocated to your second node. Then you would use the nodes and ppn values given in the sample script and omit the -np option to mpirun.

However, if you want to allocate these 16 processes in a cyclic manner then you would use the PBS specification

    #PBS -l nodes=2:ppn=1

and you would give the -np option to mpirun a value of 16. This would allocate your first process to your first node, your second process to your second node, your third process to your first node, your fourth process to your second node, and so on, until all 16 processes are allocated. You must use the -np option to mpirun or the system will think you only want to allocate 2 processes.

You may want to allocate fewer than 8 processes per node so you have fewer processes dividing up the 8 Gbytes of memory available on a node. For example, the PBS specification

    #PBS -l nodes=2:ppn=4

would allocate only 4 processes for each of your two nodes, if you do not use the -np option to mpirun. Since jobs do not share nodes, you will still pay for the entire node even though you are not using all 8 cores on the node, but you do have access to the entire memory on the node.

#PBS -l walltime=5:00

The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.

#PBS -j oe

The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.

#PBS -q batch

The final PBS directive requests that your job be run in the batch queue. To request the debug queue you would replace 'batch' by 'debug'.

The remaining lines in the script are comments and command lines.

set echo

This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use

set -x

instead.

Comment lines

The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.

mpirun ./mympi

This command launches your executable on warhol's compute nodes. You must use mpirun to run your MPI executable or it will run on a frontend node and degrade overall system performance.

Sample OpenMP batch job

A sample job script to run an OpenMP program is

#!/bin/csh
#PBS -l nodes=1
#PBS -l walltime=5:00              
#PBS -j oe
#PBS -q batch

set echo

#move to my $SCRATCH directory
cd $SCRATCH

#copy executable to $SCRATCH
cp $HOME/myopenmp .

#set number of OpenMP threads
setenv OMP_NUM_THREADS 8

#run my executable
./myopenmp

You can only run an OpenMP program on warhol on one node.

Sample hybrid batch job

A sample script to run a hybrid MPI and OpenMP program is

#!/bin/csh
#PBS -l nodes=4:ppn=1
#PBS -l walltime=5:00              
#PBS -j oe
#PBS -q batch

set echo

#move to my $SCRATCH directory
cd $SCRATCH

#copy executable to $SCRATCH
cp $HOME/myhybrid .

#set number of OpenMP threads
setenv OMP_NUM_THREADS 8

#run my executable
mpirun ./myhybrid

This job assumes you have an MPI program wtih 4 ranks. The job will distribute one rank per node. Each rank will generate 8 OpenMP threads. The OpenMP threads cannot communicate across nodes, but the MPI ranks can.

If you want to run on a different number of nodes or have a different number of ranks per node or generate a different number of OpenMP threads per node, you must adjust the values of nodes, ppn and OMP_NUM_THREADS. For example, a specification of

    #PBS -l nodes=2:ppn=2

combined with setting OMP_NUM_THREADS as follows

    setenv OMP_NUM_THREADS 4

is suitable for a program that runs 2 MPI ranks on each of 2 nodes, with 4 OpenMP threads being generated on each node.

Qsub command

After you create your batch script you submit it to PBS with the qsub command.

    qsub myscript.job

Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.

You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample script and submit the script with the command

    qsub -l nodes=2:ppn=8  -l walltime=5:00 -j oe -q batch myscript.job

Command-line directives override directives in your scripts.

Interactive access

A form of interactive access is available on warhol by using the -I option to qsub. For example, the command

    qsub -I -q debug -l nodes=2:ppn=8 -l walltime=5:00

requests interactive access to 8 cores for 5 minutes in the debug queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.

When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI program you must use the mpirun command just as you would in a batch script.

When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.

Other qsub options

Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.

-m a|b|e|n
Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
-M userlist
Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
-v variable_list
This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on warhol.
-r y|n
Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
-W group_list=charge_id
Indicates to which charge-id you want a job to be charged. You can see your valid charge-ids by greping your entry in the /etc/group file. You replace 'charge_id' in the above option by the charge-id you want your job to be charged to. Your default charge-id is indicated by the group field in your entry in the /etc/passwd file. The fourth field in your entry in the /etc/passwd file is your group-id. If you grep for this number in the /etc/group file the first field of the output is your default charge-id. If you want to switch your default charge-id send email to remarks@psc.edu. If you only have one grant on bigben you do not need to use this option. This option can only be specified as a command-line option.
-W depend=dependency:jobid
Specifies how the execution of this job depends on the status of other jobs. Some values for dependency are:
afterthis job can be scheduled after job jobid begins execution.
afterokthis job can be scheduled after job jobid finishes successfully.
afternotokthis job can be scheduled after job jobid finishes unsucessfully.
afteranythis job can be scheduled after job jobid finishes in any state.
before this job must begin execution before job jobid can be scheduled.
beforeokthis job must finish successfully before job jobid begins
beforenotok   this job must finish unsuccessfully before job jobid begins
beforeanythis job must finish in any state before job jobid begins

Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.

Sample serial job and serial job packing

A sample script to run a serial job is

#!/bin/csh
#PBS -l nodes=1
#PBS -l walltime=5:00              
#PBS -j oe
#PBS -q batch

set echo

#move to my $SCRATCH directory
cd $SCRATCH

#copy executable to $SCRATCH
cp $HOME/myserial .

#run my executable
./myserial datafile1

A possible problem with this script is that you are only running one execution on your node, but you are going to pay for the entire node. If you do not need all the memory on the node for this one execution, you may want to pack more executions on this one node in your job. For example, the script

#!/bin/csh
#PBS -l nodes=1
#PBS -l walltime=5:00              
#PBS -j oe
#PBS -q batch

set echo

#move to my $SCRATCH directory
cd $SCRATCH

#copy executable to $SCRATCH
cp $HOME/myserial .

#run my executables
numactl --physcpubind=0 ./myserial input1 output1&
numactl --physcpubind=1 ./myserial input2 output2&
numactl --physcpubind=2 ./myserial input3 output3&
numactl --physcpubind=3 ./myserial input4 output4&
numactl --physcpubind=4 ./myserial input5 output5&
numactl --physcpubind=5 ./myserial input6 output6&
numactl --physcpubind=6 ./myserial input7 output7&
numactl --physcpubind=7 ./myserial input8 output8&
wait

will run 8 executions on your single node. The numactl command is necessary to ensure that each execution runs on its own core.

Packing MPI jobs

You may also want to pack multiple MPI runs in a single job, either so you do not waste cores or for convenience. For example, suppose you have 4 MPI runs that each use 2 cores. You could run each on its own node, but then you would waste 6 cores per node--and you have to pay for those cores whether you use them or not--or you could execute all 4 runs on a single node.

The following script will do this.

  #!/bin/csh 
  #PBS -l nodes=1:ppn=8
  #PBS -l walltime=30:00
  #PBS -j oe
  #PBS -q batch

  # set the number of cores each run will use
  # this number times your number of MPI runs must equal
  # the total cores requested
  set nc = 2

  # create a machinefile for each MPI run
  # nodes_00 will be the first machinefile, nodes_01 will be the second and so on
  split -l $nc -d $PBS_NODEFILE nodes_

  #run your MPI programs with mpirun
  mpirun -np $nc -machinefile nodes_00 affinity ./a.out input1 output1 &
  mpirun -np $nc -machinefile nodes_01 affinity ./a.out input2 output2 &
  mpirun -np $nc -machinefile nodes_02 affinity ./a.out input3 output3 &
  mpirun -np $nc -machinefile nodes_03 affinity ./a.out input4 output4 &
  wait

A machinefile lists all the cores assigned to an execution. The split command creates from the machinefile for your entire job, which is created by the system, a machinefile for each of your MPI runs. The affinity command insures that each MPI run is fixed throughout its execution to the cores to which it is initially assigned. Then each of your runs will run on its own set of cores. This example, and the examples below, assume that each of your runs uses the same number of cores. You will have to modify this script appropriately if that is not in the case in your situation.

This method will also work if you are spreading your MPI runs across more than one node. In this next example you have 4 4-core runs and you want to spread them across 2 nodes.

  #!/bin/csh 
  #PBS -l nodes=2:ppn=8
  #PBS -l walltime=30:00
  #PBS -j oe
  #PBS -q batch

  # set the number of cores each run will use
  # this number times your number of MPI runs must equal
  # the total cores requested
  set nc = 4

  # create a machinefile for each MPI run
  # nodes_00 will be the first machinefile, nodes_01 will be the second and so on
  split -l $nc -d $PBS_NODEFILE nodes_

  #run your MPI programs with mpirun
  mpirun -np $nc -machinefile nodes_00 affinity ./a.out input1 output1 &
  mpirun -np $nc -machinefile nodes_01 affinity ./a.out input2 output2 &
  mpirun -np $nc -machinefile nodes_02 affinity ./a.out input3 output3 &
  mpirun -np $nc -machinefile nodes_03 affinity ./a.out input4 output4 &
  wait

Like the previous job, this job has 4 MPI runs, but they are spread across 2 nodes. In neither case do you waste cores.

In the final example, you have 2 16-core runs and you want to pack them in a single job for convenience.

  #!/bin/csh 
  #PBS -l nodes=4:ppn=8
  #PBS -l walltime=30:00
  #PBS -j oe
  #PBS -q batch

  # set the number of cores each run will use
  # this number times your number of MPI runs must equal
  # the total cores requested
  set nc = 16

  # create a machinefile for each MPI run
  # nodes_00 will be the first machinefile, nodes_01 will be the second
  split -l $nc -d $PBS_NODEFILE nodes_

  #run your MPI programs with mpirun
  mpirun -np $nc -machinefile nodes_00 affinity ./a.out input1 output1 &
  mpirun -np $nc -machinefile nodes_01 affinity ./a.out input2 output2 &
  wait

Each run will run on 2 nodes, but you need submit only 1 job.

Monitoring and Killing Jobs

The qstat -a command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.

The qdel command is used to kill queued and running jobs. An example is the command

    qdel 54

The argument to qdel is the jobid of the job you want to kill, which you are shown when you submit your job or you can get it with the qstat command. If you cannot kill a job you want to kill send email to remarks@psc.edu.

Software Packages

A list of software packages available on warhol is available online. If you would like us to install a package that is not in this list send email to remarks@psc.edu.

The Module Command

To run many software packages paths and other variables must often first be set. To change versions of a package these definitions must often be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see

    http://www.psc.edu/general/software/packages/module/

Stay Informed

As a user of warhol, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently. In addition, important system information is posted to the PSC's Web page of bboard posts.

You will also periodically receive email from PSC with information about warhol. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.

Acknowledgement in Publications

PSC requests that a copy of any publication (preprint or reprint) resulting from research done on warhol be sent to the PSC Allocations Coordinator. We also request that you include an acknowledgement of PSC in your publication.

Reporting a Problem

You have two options for reporting problems on warhol.

  • You can call the User Services Hotline at 1-800-221-1641 from 9:00 a.m. until 8:00 p.m., Eastern time, on weekdays, and from 9:00 a.m. until 4:00 p.m., Eastern time, on Saturdays.

  • You can send email to remarks@psc.edu.