Pople
- System Configuration
- Access to Pople
- Getting an account on pople
- Connecting to pople
- Changing your PSC Kerberos password
- Changing your login shell
- Accounting on pople
- Storing Files
- Transferring files
- Improving your file transfer performance
- Creating Programs
- Running Jobs
- Queue structure
- Scheduling policies
- Sample batch jobs
- Qsub command
- Flexible walltime requests
- How to improve your turnaround
- Interactive access
- Other qsub options
- Using the module command in a batch script
- Using the module command in an interactive job
- Monitoring and Killing Jobs
- Debugging
- Improving Performance
- Calculating Mflops
- The perfmon utility
- Cache performance
- Collecting timing data
- IO optimization
- Placing processes for an MPI program
- Placing processes for an OpenMP program
- Third-party software
- Performance monitoring tools
- Assistance with improving performance
- Software Packages
- The Module Command
- Pople and the TeraGrid
- Stay Informed
- Acknowledgement in Publications
- Reporting a Problem
System Configuration
Hardware
Pople is an SGI Altix 4700 shared-memory NUMA system comprising
192 blades. Each blade holds 2 Itanium2 Montvale 9130M
The four cores on each blade share 8 Gbytes of local memory. The processors are connected by a NUMAlink interconnect. Through this interconnect the local memory on each processor is accessible to all the other processors on the system. Pople runs an enhanced version of the SuSE Linux operating system.
There are multiple frontend processors, which are also Itanium2 processors and which run the same version of SuSE Linux as the compute processors. You login to one of these frontend processors, not to the compute processors.
Software
The Intel C, C++ and Fortran compilers and the Gnu C and C++ compilers are installed on pople, as are the facilities to enable you to run OpenMP, MPI and hybrid OpenMP and MPI programs.
Access to Pople
Getting an account on pople
There are three types of grants available on pople: startup, research and education grants. Startup grants are available as precursors to large requests or for work which will exploit the unique architectural capabilites of pople. Startup grants can ask for at most 30,000 Service Units (SUs). Research grants are large awards for users with extensive computational requirements. Education grants are for coursework on pople. To apply for each of these types of grants on pople you must fill out the online POPS proposal form.
Connecting to pople
If you have a TeraGrid grant the recommended method for connecting to pople is by using your TeraGrid-wide password after you have downloaded the necessary software components to your local machine. This method allows a single-signon for all TeraGrid resources. Thus, it is especially useful if you have allocations on more than one TeraGrid resource.
If you are not a TeraGrid user you can connect to pople by using ssh to connect to tg-login.pople.psc.teragrid.org or to pople.psc.edu. When you are prompted for a password by ssh enter your PSC Kerberos password.
Changing your PSC Kerberos password
Use the kpasswd command to change your PSC Kerberos password, not the passwd command. You have the same password on all PSC production platforms. If you change your password on one PSC system using kpasswd you change it on all other PSC systems.
PSC Kerberos passwords must be at least 8 characters in length. They must also contain characters from at least 3 of the character classes:
- lower-case letters
- upper-case letters
- digits
- special characters, excluding ' and "
Finally, they must not be the same as any of your previous passwords.
You must change your pople password within 30 days of the date on your initial password form or your password will be disabled. We will also disable your password if you do not change it at least once a year. We will send you an email notice warning you that your password is about to be disabled in the latter case. See the PSC password policies for more information. If your password is disabled send email to remarks@psc.edu to have it reset.
Changing your login shell
You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.
Accounting on pople
One core-hour on pople is one SU. Because resources are allocated by blades and not by cores--jobs do not share blades--your SU charges will always be based on core usage that is a multiple of four. A one blade job that runs for one hour costs 4 SUs.
If you have more than one account, use the qsub option -W group_list to indicate to which account you want a job to be charged. The use of this option is discussed in the "Other qsub options" subsection of this document. To change your default account you must send email to remarks@psc.edu with this request.
User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.
Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to grants PIs. The system has extensive internal documentation.
Storing Files
File Systems
File systems are file storage spaces directly connected to a system. There are currently three such areas available to you on pople.
- $HOME
-
This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of pople's compute and frontend processors. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to golem. Golem, PSC's file archival system, is discussed below.
If you have loaded the teragrid module you can also refer to your home directory as $TG_CLUSTER_HOME.
- $SCRATCH
-
This is pople's scratch area to be used as a working space for your running jobs. $SCRATCH is visible to all of pople's compute and frontend processors. You should use the name $SCRATCH to refer to your scratch area since we may change its implementation. $SCRATCH is a parallel file system.
$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 7 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to golem as soon as you can after you create them. Golem, PSC's file archival system, is discussed below. For information on improving your $SCRATCH IO performance see the section on IO optimization below.
If you have loaded the teragrid module you can also refer to your scratch directory as $TG_CLUSTER_SCRATCH.
- DC-WAN
-
This is a TeraGrid-wide filesystem hosted at Indiana University. Information about its use is available online.
File Repositories
File repositories are file storage spaces which are not directly connected to a frontend or compute processor. You cannot, for example, open a file that resides in a file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on pople: golem, PSC's file archival system.
- golem
-
Golem is a combination tape-and-disk archival system. The far program should be used to tranfer files between golem and pople. You should transfer files between golem and pople outside of your batch jobs. Otherwise your jobs will be holding compute processors while your files are being transferred. You can use scp or kftp to transfer files between golem and your remote machine. If you need to store a file to golem that is 2 Tbytes or larger or if you are going to store more than 500 Gbytes of data in a day send email to remarks@psc.edu so that special arrangements can be made to store your files.
Transferring Files
To transfer files between your local machine and pople you can use either the scp or kftp programs or, if your local machine is on the TeraGrid, the globus-url-copy command.
You can also use DMOVER to transfer your files. DMOVER is a set of programs used to schedule and execute bulk, parallel transfers of files. Extensive information about how to use DMOVER is available online. If you have questions about the use of DMOVER send email to remarks@psc.edu.
Improving Your File Transfer Performance
Before you select a file transfer method you should perform two steps. First, there are certain TCP tuning operations that will improve your file transfer performance no matter which of the three file transfer methods described above you choose. You will probably need the assistance of a network administrator at your site to perform these tuning operations.
Second, you should determine the receive buffer size for your machine. The receive buffer size for pople is 16 Mbytes. You should always perform large file transfers in the direction of the machine with the largest receive buffer size. You will probably need the assistance of a network administrator at your site to determine the size of your machine's receive buffers and to increase it, if possible.
Once you have performed these two steps, you must select a file transfer method. If your machine is on the TeraGrid, to get the best the file transfer performance you should use the globus-url-copy command. If you are not on the TeraGrid but do have Kerberos authorization available on your machine you should use kftp. Otherwise you should use the scp program. If you use scp you should install the hpn-ssh patches. You will probably need the assistance of a network administrator at your site to install these patches.
Choosing a file transfer method is the final step you can take to improve your file transfer performance. There are no options to any of the above three commands which will impact your file transfer performance. If you have questions about the recommendations in this section or believe that your performance after following these recommendations is still substandard compared to your prior results on pople or other machines send email to remarks@psc.edu. Sometimes file transfer performance can be improved by routing changes made by network administrators, either at your site or at PSC.
Creating Programs
The Intel C, C++ and Fortran compilers and the GNU C and C++ compilers are installed on pople and they can be used to create OpenMP, MPI, hybrid and serial programs. The commands you should use to create each of these types of programs are shown in the table below.
| OpenMP | MPI | Hybrid | Serial | |
| Intel Fortran | ifort -openmp myopenmp.f | ifort mympi.f -lmpi | ifort -openmp myhybrid.f -lmpi | |
| Intel C | icc -openmp myopenmp.c | icc mympi.c -lmpi | icc -openmp myhybrid.c -lmpi | |
| Intel C++ | ||||
| GNU C | gcc -fopenmp myopenmp.c | gcc mympi.c -lmpi | gcc -fopenmp myhybrid.c -lmpi | |
| GNU C++ |
Man pages are available for ifort, icc and icpc and for gcc and g++.
You should use the system-supplied version of MPI. If that version is inadequate for your needs send email to remarks@psc.edu.
The UPC compiler is installed on pople. Online instructions for its use are available.
Several versions of Java are available on pople, including the jrockit version. To see which versions are available issue the command
module avail
and search for the java or jrockit modules. We recommend that you try all the versions to see which one performs the best for your program.
Running Jobs
Queue structure
Torque, an open source version of the Portable Batch System (PBS), controls all access to pople's compute processors, for both batch and interactive jobs. Currently pople has two queues: the batch queue and the debug queue. Interactive jobs can run in the debug queue and batch queue and the method for doing so is discussed below.
Jobs submitted to the batch queue are actually split by the system into two subqueues based on their walltime request: the batch_r and the batch_l queue. You do not submit jobs directly into the batch_r or batch_l queues.
Jobs in the batch_r or regular queue have requested walltimes that range up to and including 48 hours. Jobs in the batch_r queue can request up to and including 604 cores.
Jobs in the batch_l or long queue have requested walltimes that range above 48 hours up to and including 168 hours. Jobs in the batch_l queue can ask for at most 36 cores. The batch_l queue is to be used for long jobs that cannot be checkpointed.
The maximum walltime for the debug queue is 1 hour and the maximum number of cores you can request is 16. In total there are 32 cores allocated to the debug queue. The debug queue is not to be used for short production jobs.
Scheduling policies
The batch and debug queues are basically FIFO queues. However, there are mechanisms in place to prevent a single user from dominating either queue and to prevent idle time on the machine. The result is some deviation from a strictly FIFO scheme.
There are suggestions below on how to improve your job turnaround. We will modify the scheduling policies on pople to meet user needs. If you have comments about the scheduling policies on pople or find that they do not meet your needs send email to remarks@psc.edu.
Sample batch jobs
To run a batch job on pople you submit a batch script to the scheduler. A job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.
A sample job script to run an OpenMP program is
#!/bin/csh #PBS -l ncpus=4 #ncpus must be a multiple of 4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myopenmp . #run my executable setenv OMP_NUM_THREADS 4 ./myopenmp ja -chlst
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command in your batch script, then you must include the -l option to your shell command.
The four #PBS lines are PBS directives.
- #PBS -l ncpus=4
-
This directive specifies the number of cores to allocate for the job. For performance reasons the actual allocation of resources is done by blades, with each blade containing four cores. You must request cores in multiples of four. Jobs do not share blades.
The value of ncpus is the number of cores requested. Here we request 4 cores. The number of cores must be a multiple of four, or the job submission will fail. Within your batch script the environment variable PBS_NCPUS is set to the number of cores you requested.
Each blade has 8 Gbytes of physical memory. If your job exceeds the amount of physical memory available to it--a job requesting 16 cores will run on 4 blades and thus have 32 Gbytes of memory available to it--it will be killed by the system with a message similar to
PBS: Job killed: cpuset memory_pressure X reached/exceeded limit 1written to its stderr. A cpuset is the set of blades--cores and associated memory--assigned to your job. Memory pressure is a metric that indicates whether the processes on a blade are attempting to free up in use memory on the blade to satisfy additional memory requests. Since this use of memory would result in significantly lower performance, a job that attempts to do this is killed by the system. For more information about cpusets and memory pressure see the man page man 4 cpuset.
If this happens to your job you should resubmit it and ask for more cores. The output from the ja command, which is discussed below, can help you determine how many blades your job needs.
- #PBS -l walltime=5:00
-
The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
-
The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.
Your stdout and stderr files are each limited to 20 Mbytes. If your job exceeds either of these limits it will be killed by the system. If you have a program that you think will exceed either of these limits you should redirect either your stdout or stderr output or both to a $SCRATCH file. Another option is run your job from $SCRATCH.
- #PBS -q batch
-
The final PBS directive requests that your job be run in the batch queue. The system will route your job to the batch_r or batch_l queue based on your resource requests.
The remaining lines in the script are comments and command lines.
- set echo
-
This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use
set -x
instead.
- ja
-
The ja command turns on job accounting for your job. This allows you to obtain information on the elpased time and memory and IO usage of your program, plus other data.
You must pair the command with another ja command at the end of your job. The option -t to this second ja command turns off job accounting and writes your accounting data to stdout. The other options to the second example ja command determine what output you will receive from ja. We recommend these options because we think they will provide detailed but useful information about your job's processes. However, you can look at the man page for ja to see what reporting options you want to use.
There is no overhead to using ja. We strongly recommend that you use ja so you can understand the resource usage of your jobs, which you can use when you submit future jobs. The output from ja can also be used for debugging and performance improvement purposes.
- Comment lines
-
The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.
- setenv OMP_NUM_THREADS 4
-
This command sets the number of threads for your OpenMP program to use. It is set to 1 by default. You should set this value to the number of cores you requested with your PBS ncpus directive so each of your threads will run on its own core.
- ./myopenmp
-
This command runs your executable.
A sample job to run an MPI program is
#!/bin/csh #PBS -l ncpus=4 #ncpus must be a multiple of 4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #run my executable mpirun -np 4 ./mympi ja -chlst
This script is identical to the OpenMP script except when you run your executable. You do not have to set the variable OMP_NUM_THREADS, but you have to use the mpirun command to launch your executable on pople's compute processors. The value for the -np option is the number of cores you want your program to run on. You should set -np to the number of cores you requested with your PBS ncpus directive. You must use mpirun to run your MPI executable or it will run on a frontend and degrade overall system performance.
A sample job to run a hybrid OpenMP and MPI program is
#!/bin/csh #PBS -l ncpus=64 #ncpus must be a multiple of 4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myhybrid . #run my executable setenv OMP_NUM_THREADS 4 mpirun -np 16 omplace -nt 4 ./myhybrid ja -chlst
This script is identical to the above two scripts except when you run your executable. You use a combination of the mpirun and omplace commands to run your hybrid program. The value of the -np option to the mpirun command is the number of your MPI tasks. The value of the -nt option to the omplace command is the number of your OpenMP threads per MPI task. The value of the -nt option and the value you set OMP_NUM_THREADS to must be the same. The product of these two values should be the total number of cores you requested with your PBS ncpus specification.
The omplace command insures that each of your OpenMP threads runs on its own core. You must use mpirun to run your hybrid executable or it will run on a frontend and degrade overall system performance.
Qsub command
After you create your batch script you submit it to PBS with the qsub command.
qsub myscript.job
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample scripts and submit the scripts with the command
qsub -l ncpus=4 -l walltime=5:00 -j oe -q batch myscript.job
Command-line directives override directives in your scripts.
Flexible walltime requests
Two other qsub options are available for specifying your job's walltime request.
-l walltime_min=HH:MM:SS -l walltime_max=HH:MM:SS
You can use these two options instead of "-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround in several circumstances.
For example, to accommodate large jobs, the system actively drains blades to create dynamic reservations. The blades being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill blades.
As an example, if your job requests 64 cores and a range of walltime between 2 and 4 hours and a 64-core slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.
Another situation in which specifying a flexible walltime could improve your turnaround is the period leading up to a full drain for system maintenance. The system will not start a job that will not finish before the system maintenance time begins. A job with a flexible walltime could start if the flexible walltime range overlaps the period when the maintenance time starts. A job with a fixed walltime that would not finish until after the maintenance period begins would not be started.
If the system starts one of your jobs with a flexible walltime request, the system selects a walltime within the two specified limits. This walltime will not change during your job's execution. You can determine the actual walltime your job was assigned by examining the Resource_List.walltime field of the output of the qstat -f command. The command
qstat -f $PBS_JOBID
will give this output for the current job. You can capture this output to find the value of the Resource_List.walltime field.
You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit so that the file writing is completed when the limit is reached. The functions mpi_wtime and omp_get_wtime can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.
You may also want to save time at the end of your job to allow your job to transfer files after your program ends but before your job ends. You can use the timeout command to specify in seconds how long you want your program to run. Once your job determines what its actual walltime is you can, after subtracting the amount of time you want for file transfer at the end of your job, use this value in a timeout command. For example, assume your job is assigned a walltime of 1 hour and you want your program to stop 10 minutes before your job ends to allow your job to have adequate time for file transfer. To accomplish this you could use a command like the following
timeout --timeout=$PROGRAM_TIME -- mpirun -np 4 ./mympi
The example assumes that your script has retrieved your job's walltime, subtracted 10 minutes from it and assigned the value of 3000 to the variable PROGRAM_TIME. You will probably also want to provide this value to your program. Your program can then use this value to appropriately write out checkpoint files. When your program ends your job will have time to perform necessary file transfers before your job ends.
For more information on the timeout command see the timeout man page. If you want assistance on the procedures needed to capture your job's actual walltime or to determine when your job should write checkpoint files send email to remarks@psc.edu.
How to improve your turnaround
We have several suggestions for how to improve your job turnaround. Firstly, you should try to be as accurate as possible in estimating the walltime request for your job. Asking for more time than your job will actually need will almost certainly result in poorer turnaround for your job.
Thus, unreflectively asking for the maximum walltime you can ask for a job will almost always result in poorer turnaround. For example, if you have a job that asks for more than 36 cores--and will therefore be slotted in the batch_r queue--and you also ask for 48 hours of walltime, your turnaround will undoubtedly be very poor. A similar conclusion applies to jobs that ask for less than 36 cores and 168 hours of walltime.
Our second recommendation is that you always use flexible walltime requests if possible. This is especially helpful if your minimum walltime in your pair of walltime values is less than 8 hours.
Interactive access
A form of interactive access is available on pople by using the -I option to qsub. For example, the command
qsub -I -l ncpus=4 -l walltime=5:00 -q debug
requests interactive access to 4 cores for 5 minutes in the debug queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI or hybrid program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
X-11 connections in interactive use
In order to use any X-11 tool, you must also include -X on the qsub command line:
qsub -X -I -l ncpus=4 -l walltime=5:00 -q debug
This assumes that the DISPLAY variable is set. Two ways in which DISPLAY is automatically set for you are:
- Connecting to pople with ssh -X pople.psc.edu
- Enabling X-11 tunneling in your Windows ssh tool
Totalview, Fluent and TAU are among the packages which require X-11 connections.
Other qsub options
Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on pople.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- Indicates to which charge-id you want a job to be charged. You can see your valid charge-ids by greping your entry in the /etc/group file. You replace 'charge_id' in the above option by the charge-id you want your job to be charged to. Your default charge-id is indicated by the group field in your entry in the /etc/passwd file. The fourth field in your entry in the /etc/passwd file is your group-id. If you grep for this number in the /etc/group file the first field of the output is your default charge-id. If you want to switch your default charge-id send email to remarks@psc.edu. If you only have one grant on bigben you do not need to use this option. This option can only be specified as a command-line option.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of
other jobs. Some values for dependency are:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Using the module command in a batch script
The function of the module command is described below. Depending on your login shell and the shell you use in your batch script you may have to make changes to your batch script if you want to use the module command in your batch script.
If your login shell is csh and your batch script uses csh as its shell then if you need to use the module command in your batch script you must include the commands
source /usr/share/modules/init/csh
source /etc/csh.cshrc.psc
in your batch script after your PBS specifications. If you use tcsh as your login shell and as your batch shell you must include the commands
source /usr/share/modules/init/tcsh
source /etc/csh.cshrc.psc
in your script.
If your login shell is csh or tcsh and you use sh or bash in your batch script you must start your job with the line
#!/bin/sh -l
or
#!/bin/bash -l
depending on whether you want to use sh or bash in your batch script.
If your login shell is sh or bash and you use csh or tcsh as your batch shell you must include the two source commands in your batch script that were described above in the first case.
If your login shell is sh or bash and you use either sh or bash as your batch shell you do not need to make any changes to your batch scripts.
Using the module command in an interactve job
You do not need to issue any special commands if you want to use the module command in an interactive session, but you should not switch your shell from your login shell duing your interactive session.
Monitoring and Killing Jobs
The qstat -a command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.
The qdel command is used to kill queued and running jobs. An example is the command
qdel 54
The argument to qdel is the jobid of the job you want to kill, which you are shown when you submit your job or you can get it with the qstat command. If you cannot kill a job you want to kill send email to remarks@psc.edu.
Debugging
Debugging strategy
Your first few runs should be on a small version of your problem. You first run should not be for your largest problem size. It is easier to solve code problems if you are using fewer processors. This strategy should be followed even if you are porting a working code from another system.
You should use the debug queue for your debugging runs. Do not run a debugging run on any of pople's front ends. You should always run a pople program with qsub.
The debug queue is intended to be used in the classic debugging cycle in which you run a debugging job, check its output and then submit another debugging job. You should not flood the debug queue with jobs nor should you chain your jobs through the debug queue by having a debug job submit its sucessor.
The debug queue should not be used for production runs that use only a few processors.
TotalView
The TotalView debugger is installed on pople. Online instructions for its use are available.
Compiler options
Several compiler options can be useful to you when you are debugging your program. If you use the -g option to the Intel or GNU compilers, the error messages you receive when your program fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code you should not use the -g option or your program will run slower.
The -check bounds option to the ifort compiler will cause your program to tell you if it exceeds an array bounds while running.
Variables on pople are not automatically initialized. This can cause your program to fail if it relies on variables being initialized. The -check uninit and -ftrupuv options to the ifort compiler will catch certain cases of uninitialized variables, as will the -Wall and -O options to the GNU compilers.
There are more options to the Intel and GNU compilers that may assist you in your debugging. For more information see the appropriate man pages.
Core files
The key to making core files on pople is to allow them to be written by increasing the maximum file size allowable for core files. The default size is 0 bytes. If you are using sh-type shells you do this by issuing the command
ulimit -c unlimited
For csh-type shells you issue the command
limit coredumpsize unlimited
Core files are created in directory ~/tmp. For more information about core files issue the command
man 5 core
Little endian versus big endian
The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Pople is a little endian machine, which means that the low-order byte of a number is stored in the memory location with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a big endian machine.
If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command
echo 'puts $tcl_platform(byteOrder)' | tclsh
You can read a big endian file on pople if you are using the Intel ifort compiler. Before you run your program issue the command
setenv FORT_CONVERTn big-endian
for each Fortran unit number from which you are reading a big endian file. For 'n' substitute the appropriate unit number.
Improving Performance
Calculating Mflops
You can calculate your code's Mflops rate using the TAU utility. The TAU examples show how to determine timing data and floating point operation counts for your program, from which you can calculate your Mflops rate.
The perfmon utility
The perfmon utility, which is available on many platforms, provides access to a processor's hardware performance counters. You can use the perfmon utility to collect Mflops and other performance data for your program. Perfmon reports top-level data for your program and also sub-level data for each of your program's sub-processes. We have created a perfmon wrapper, which you can call at the start of your batch job if you want to collect this performance data and view it after your job runs. The wrapper automatically starts the pfmon command to monitor all your job's processes.
To enable pfmon include the qsub option -l other=enable_pfmon_collection to the qsub command when you submit your job. Pfmon will collect performance data while your program is running. Once your job finishes you must run the program pfmon_report_tool on a login node to examine your performance data output. The format of the command is
pfmon_report_tool jobid
For 'jobid' substitute the jobid of your job. The output from this command is sent to standard out. The ave_mflops result listed at the top of the output is the average Mflop value for your overall program. The output also includes performance data, including an Mflops value, for each of your program's sub-processes. You can use the latter information to determine if there is a performance problem with one of these sub-processes.
Pfmon runs on the blades that run your program. Thus, it can degrade the performance of your program by taking memory and program cycles from your program. We have found this effect to be usually negligible. The maximum degradation we have observed is 10%. If you are concerned about this effect you can compare the execution time of your program with and without using pfmon. Even if pfmon is degrading the performance of your program you can use it to judge directional performance changes when you modify your program to improve its performance.
For more information on perfmon see the pfmon man page and the Web page
http://perfmon2.sourceforge.net/
If you have questions about using perfmon send email to remarks.
Cache performance
Cache performance can have a significant impact on the performance of your program. Each pople core has three levels of cache. The primary data and instruction caches are 4-way set associative and 16 Kbytes each. The L2 data cache is 256 Kbytes, while the L2 instruction cache is 1 Mbyte. Both L2 caches are 8-way set associative. The L3 cache is 8 Mbytes.
You can measure your program's cache miss rate for each of the available caches by setting the appropriate counters when using the TAU utility. If you need assistance in measuring or improving your cache performance send email to remarks@psc.edu.
Collecting timing data
Collecting timing data is essential for measuring and improving program performance. We recommend five approaches for collecting timing data. The ja and /usr/bin/time utilities can be used to collect data at the program level. They report results to the hundredths of seconds. The TAU utility and the omp_get_wtime and MPI_Wtime functions can be used to collect timing data at a finer grain. The default precision for TAU is microseconds, but the -linuxtimers or -papi option can be used to obtain nanosecond precision. The precision for omp_get_wtime is microseconds, while the precision for MPI_Wtime is nanoseconds.
IO optimization
File striping
If your program reads or write large files you should use $SCRATCH. Your $HOME space is limited. In addition, the $SCRATCH file space is implemented using the Lustre parallel file system. A program that uses $SCRATCH can perform parallel IO and thus can significantly improve its performance. File striping can be used to tune your parallel IO performance and is particulary effective for files that are 1 Gbyte or larger.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. This is how you can use Lustre as a parallel file system. Pople currently has 24 OSTs.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as ls -l or unlink will take longer for striped files. Also, striped files are more at risk for data loss due to hardware failure. If a file is spread across several OSTs a hardware failure of any of them will result in the loss of part of the data in that file. You may choose to lose all of a small number of files rather than parts of all of a large number of your files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is
lfs setstripe filename stripe-size OST-start stripe-count
We recommend that you always set the stripe size parameter to 0 and the starting OST parameter to -1. This will result in the default stripe size of 1 Mbyte and assign your starting OST in a round-robin fashion. A value of -1 for the stripe count means the file should be spread across all the available OSTs.
For example, the command
lfs setstripe bigfile.out 0 -1 -1
sets the stripe count for bigfile.out to be all available OSTs.
The command
lfs setstripe manyfiles.out 0 -1 1
has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application. However, we do recommend that you use a stripe count that is less than 8.
There is a man page for lfs on pople. Online documentation for Lustre is also available. If you want assistance with what striping strategy to follow send email to remarks@psc.edu.
Placing processess for an MPI program
The cores on pople are arranged in a multi-level hierarchical system. Pople is divided into units called racks. Each rack has 4 IRUs (Individual Rack Unit). Each IRU has 8 blades, and each blade has 4 cores.
Inter-rack, inter-IRU and inter-blade accesses all degrade program performance. The pople scheduler places jobs on the machine in a manner designed to reduce inter-rack and inter-IRU accesses. You can control how the processes in your MPI program are placed on the blades assigned to your jobs in order to reduce your inter-blade accesses.
If you follow the procedure discussed below your processes will be pinned to the cores to which you assign them. Otherwise the operating system can move your processes to different cores while your job is running. This could result in your processes running on different blades from those on which the memory allocated for those processes is located. The memory for a process is allocated on the blade on which it is first referenced and is not moved. In addition, if your processes are assigned to fixed locations, you can create a process allocation structure that best matches the communication topology of your program and thus results in fewer inter-blade accesses.
The first piece of information you need to allocate your MPI processes
is the physical topology
of pople. This is contained in the file
A typical entry in this section of the file looks like
54 001c11^8#0c . . .
The interpretation of the entry is that physical core 54 is in rack 001, in IRU c11 within this rack, and in blade 8 within this IRU. Within blade 8 it is core 0c.
The entire set of entries for this blade is
52 001c11^8#0a . . .
53 001c11^8#0b . . .
54 001c11^8#0c . . .
55 001c11^8#0d . . .
This blade, like each blade, has 4 cores. Within a blade the core numbers vary from 0a to 0d. In this blade the physical core numbers vary from 52 to 55.
You use environment variables to control how your MPI processes are distributed across your cores. If you set the environment variable MPI_DSM_VERBOSE to 1 before you run your program, the distribution of your processes across your cores will be printed on standard out. For each rank of your MPI program you will see to which physical core it has been assigned. You will get the information from the typology file that will allow you to see how your processes are distributed across your blades.
A common type of process distribution is block distribution. As an example of this type of distribution, assume you have a job that requests 16 cores. It will be assigned to 4 blades. In a block distribution, your first 4 processes will be assigned to your first blade, your second 4 to your second blade, and so on. To get this distribution of processes you set the environment variable MPI_DSM_CPULIST as follows before you run your program with mpirun
setenv MPI_DSM_CPULIST 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
The interpretation of this command is that your first process is assigned to your 0th core--the counting of cores here is of logical cores and is zero-based--your second process is assigned to your first core, which will be the second core on your first blade, and so on. In general, your ith process is assigned to the core that is the ith entry in the list you assign to MPI_DSM_CPULIST. In this case you can abbreviate the value to MPI_DSM_CPULIST as 0-15.
Block distribution is the default distribution on pople. However, if you do not set MPI_DSM_CPULIST, you do not pin your processes to cores. They can then be moved between cores while your job is running, which may result in an increase in inter-blade accesses.
If you want a different distribution for performance reasons you must give a different value to MPI_DSM_CPULIST. For example, if you are still using 4 blades, but want a cyclic distribution of processes to cores, you would set MPI_DSM_CPULIST as follows
setenv MPI_DSM_CPULIST 0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15
The interpretation of this command is that your first process is assigned to your 0th core, your second process is assigned to your 4th core, which will be the first core on your second blade, and so on.
The result in this example will be a cyclic distribution of your processes. Your first process will be assigned to your first core on your first blade, your second process is assigned to the first core on your second blade, your third process is assigned to the first core on your third blade, and your fourth process is assigned to the first core on your fourth blade. Then the allocation process wraps around to your first blade. Your fifth process is assigned to the second core on your first blade, your sixth process is assigned to the second core on your second blade, and so on, until all 16 of your processes have been allocated.
For more information on the distribution of processes for MPI programs see the mpi man page.
Placing processess for an OpenMP program
The cores on pople are arranged in a multi-level hierarchical system. Pople is divided into units called racks. Each rack has 4 IRUs (Individual Rack Unit). Each IRU has 8 blades, and each blade has 4 cores.
Inter-rack, inter-IRU and inter-blade accesses all degrade program performance. The pople scheduler places jobs on the machine in a manner designed to reduce inter-rack and inter-IRU accesses. You can control how the processes in your OpenMP program are placed on the blades assigned to your jobs in order to reduce your inter-blade accesses.
If you follow the procedure discussed below your processes will be pinned to the cores to which you assign them. Otherwise the operating system can move your processes to different cores while your job is running. This could result in your processes running on different blades from those on which the memory allocated for those processes is located. The memory for a process is allocated on the blade on which it is first referenced and is not moved. In addition, if your processes are assigned to fixed locations, you can create a process allocation structure that best matches the communication topology of your program and thus results in fewer inter-blade accesses.
You use the dplace command to distribute your OpenMP processes across your cores. A sample dplace command is
dplace -p placementfile ./myopenmp
In this command "myopenmp" is the name of your program and "placementfile" is the name of a file that contains your distribution scheme. You use the dplace command to run your OpenMP program instead of just having the command consist of the name of your executable.
A common type of process distribution is block distribution. As an example of this type of distribution, assume you have a job that requests 16 cores. It will be assigned to 4 blades. In a block distribution, your first 4 processes will be assigned to your first blade, your second 4 to your second blade, and so on.
To get this distribution the contents of your placement file should be
thread cpu=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 exact
The interpretation of this specification is that your first process is assigned to your 0th core--the counting of cores here is of logical cores and is zero-based--your second process is assigned to your first core, which will be the second core on your first blade, and so on. In general, your ith process is assigned to the core that is the ith entry in the list you assign to the cpu parameter. In this case you can abbreviate the value of the cpu parameter as 0-15. The exact parameter is what pins your processes to the cores to which they are assigned. When you set MPI_DSM_VERBOSE for an MPI program you are actually causing a dplace command to be run using the exact parameter.
Block distribution is the default distribution on pople. However, if you do not use the dplace command with the exact parameter, you do not pin your processes to cores. They can then be moved between cores while your job is running, which may result in an increase in inter-blade accesses.
If you want a different distribution for performance reasons your placement file must have a different content. For example, if you are still using 4 blades, but want a cyclic distribution of processes to cores, your placement file should look like
thread cpu=0,4,8,12,1,5,9,13,2,6,10,14,3,7,11,15 exact
The interpretation of this specification is that your first process is assigned to your 0th core, your second process is assigned to your 4th core, which will be the first core on your second blade, and so on.
The result in this example will be a cyclic distribution of your processes. Your first process will be assigned to your first core on your first blade, your second process is assigned to the first core on your second blade, your third process is assigned to the first core on your third blade, and your fourth process is assigned to the first core on your fourth blade. Then the allocation process wraps around to your first blade. Your fifth process is assigned to the second core on your first blade, your sixth process is assigned to the second core on your second blade, and so on, until all 16 of your processes have been allocated.
The -o option to the dplace command can be used to generate a log file that shows how your processes are distributed across your cores. For more information on the distribution of processes for OpenMP programs see the dplace man page.
Third-party software
Third-party routines can often perform better than routines you code yourself. You should investigate whether there is a third-party routine available to replace any of the routines you have written yourself.
For examples, we recommend the FFTW library for FFTs. For linear algebra routines we recommend the MKL library.
Performance monitoring tools
We have installed several performance monitoring tools on pople. The TAU utility is a performance profiling and tracing tool. The PAPI utility can be used to access the hardware performance counters on pople. We intend to install more performance tools on pople. If you want assistance in using any of these tools or have a utility you would like us to install send email to remarks@psc.edu.
Assistance with improving performance
If you would like to improve the performance of your code, you can get optimization assistance from PSC. This assistance includes consulting assistance from PSC, special queue handling if necessary, and service unit discounts, all of which are designed to enable you to scale up your code as quickly as possible. Send email to remarks@psc.edu if you would like performance improvement assistance with your program.
PSC also offers workshops on program optimization, which you can attend. The material from these workshops is available online.
Software Packages
A list of software packages installed on pople is available. If you would like us to install a package that is not in this list send email to remarks@psc.edu.
The Module Command
To run many software packages paths and other environment variables must first be set. To change versions of a package these definitions often have to be modified. The module command makes this process easier. You can use the module command to load a modulefile for a package that sets the necessary paths and variables to run the package.
The command
module avail
displays all the available modules, while the command
module list
displays your currently loaded modules.
The module load command loads a specific module. For example, the command
module load icc/10.1.015
sets the proper definitions for you to use version 10.1.015 of the icc compiler. When you are done with a module you can unload it and undo its effect.
module unload icc/10.1.015
Or you can swap it with another module if you want to use another version of the compiler.
module swap icc/10.1.015 icc/10.1.017
After you issue this command, when you run the icc compiler you will be using version 10.1.017 of the compiler, not version 10.1.015.
The module help command displays information about a module. For example, the command
module help tau2/tau2
displays information about this module for the TAU package. The module display commands shows what changes a module would make to your environment if you loaded it without actually making those changes. Thus, the command
module display tau2/tau2
shows what changes this TAU module would make to your environment if you loaded it.
For more information on the module command see the module and modulefile man pages and the Web page
http://www.psc.edu/general/software/packages/module/
Additional information is also available at
http://modules.sourceforge.net/
Pople and the TeraGrid
Pople is on the TeraGrid. Thus, you have additional methods of connecting to pople, of transferring files to and from pople and of running jobs on pople. For information on using the TeraGrid see the general online documentation for the TeraGrid and the PSC-specific online TeraGrid documentation.
Stay Informed
As a user of pople, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently. In addition, important system information is posted to the PSC's Web page of bboard posts.
You will also periodically receive email from PSC with information about pople. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.
Acknowledgement in Publications
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on pople be sent to the PSC Allocations Coordinator. We also request that you include an acknowledgement of PSC in your publication.
Reporting a Problem
You have several options for reporting problems on pople.
- If you are a TeraGrid user you can send email to help@teragrid.org, mentioning PSC in the subject line. You will get an acknowledgement from the TeraGrid Operations Center, and then you will be contacted by PSC staff.
- If you are a non-TeraGrid user you can send email to remarks@psc.edu.
- You can call the User Services Hotline at 1-800-221-1641 from 9:00 a.m. until 8:00 p.m., Eastern time, on weekdays, and from 9:00 a.m. until 4:00 p.m., Eastern time, on Saturdays.