- System Configuration
- Access to Blacklight
- File Spaces
- Transferring files
- Improving Your File Transfer Performance
- Creating Programs
- Running Jobs
- Monitoring and Killing Jobs
- Improving Performance
- Software Packages
- The Module Command
- Blacklight and XSEDE
- Stay Informed
- Reporting a Problem
Blacklight is an SGI UV 1000 cc-NUMA shared-memory system comprising 256 blades. Each blade holds 2 Intel Xeon X7560 (Nehalem) eight-core processors, for a total of 4096 cores across the whole machine. Each core has a clock rate of 2.27 GHz, supports two hardware threads and can perform 9 Gflops. Thus, the total floating point capability of the machine is 37 Tflops.
The sixteen cores on each blade share 128 Gbytes of local memory. Thus, each core has 8 Gbytes of memory and the total capacity of the machine is 32 Tbytes. This 32 Tbytes is divided into two partitions of 16 Tbyes of hardware-enabled shared coherent memory. Thus, users can run shared memory jobs that ask for as much as 16 Tbytes of memory. Hybrid jobs using MPI and threads and UPC jobs that need the full 32 Tbytes of memory can be accomodated on request.
Blacklight runs an enhanced version of the SuSE Linux operating system.
Blacklight also has frontend processors. You login to one of these frontend processors, not to blacklight's compute processors. You will usually prepare your job files and compile your programs on one of these front ends, but you should not run your programs on the frontends because they have limited resources. You should instead run your programs using blacklight's batch system. Thus, frontend computations will be killed if they consume too much processor computational time or processor memory. A message will be sent to your display indicating how you violated these resource limits if you run one of these offending computations. How to run a job on blacklight's compute processors with the qsub command is discussed below.
The Intel C, C++ and Fortran compilers and the Gnu Fortran, C and C++ compilers are installed on blacklight, as are the facilities to enable you to run threaded, MPI and hybrid threaded and MPI programs. OpenMP programs are a common type of threaded programs that are supported on blacklight. UPC and Java are available on the machine and are briefly documented below. In the future, more programming models which exploit the unique architectural features of blacklight will be available. When those models are available their use will be described in this document.
Also available are an array of performance tools, debuggers and libraries. A complete list of packages installed on blacklight is available online.
Access to Blacklight
Should you get a blacklight account?
Blacklight is a machine with a very large shared memory and many cores. Thus, if your application needs large shared memory, blacklight may be a suitable platform for you, whether your job needs a few dozen cores or thousands of cores. Blacklight also has a large set of installed software packages that exploit blacklight's unique hardware features. If you need a package that is not installed, send email to firstname.lastname@example.org to ask PSC to consider the installation of that package.
XSEDE provides a set of tools and services to help you to decide if blacklight or any other XSEDE supercomputer is appropriate for your work. XSEDE provides a catalog of available XSEDE resources and their features. XSEDE also provides a list of available science gateways. Gateways are specialized interfaces that can make your work easier. Finally, XSEDE provides a list of Campus Champions. A Campus Champion is a local contact you can use to discuss whether a supercomputer would be useful for your work.
Test runs on your local systems can be used to determine if a supercomputer is necessary for your research. You can also send email to email@example.com or firstname.lastname@example.org for consulting assistance to help you determine whether a supercompter would be helpful to you.
Once you decide you would like to use an XSEDE resource, such as blacklight, you must apply for a grant through the XSEDE User Portal. The first step in using the Portal is to apply for a Portal account. Then you can determine if you are eligible to become a Principal Investigator for an XSEDE grant. Normally your first XSEDE proposal will be for a Startup grant, but you can also apply directly for a Research grant. A complete description of the allocation policies for applying for grants for using XSEDE resources is available online.
Getting an account on blacklight
PSC's blacklight compute resource allocations are managed by the XSEDE program, which is supported by the National Science Foundation. For more information on applying for an allocation on blacklight see the instructions at
Managing your blacklight account
Once you have a blacklight account, there are several scenarios that can disrupt your project's progess. First, you can experience difficulties getting started. Running on a supercomputer is not a frictionless task. If you find yourself in this position, send email to email@example.com. PSC has several levels of consultants available to help you. This assistance ranges from basic questions of computer usage to scientific support for your particular project. PSC consultants will be able to assist you in getting your project started.
A second issue that can derail your project is running out of SUs. A Service Unit (SU) is defined as the use of one core for one hour. When you receive a grant, you are awarded an amount of computing time on a resource defined in terms of SUs. Each grant also has an expiration date. To prevent the situation of running out of SUs from occurring, so that you can continue to compute without interruption, you should frequently monitor your SU balance with the blacklight xbanner command. If you are running out of SUs you have several options. If your grant does not expire for at least three months, you can submit a Renewal proposal for the next XRAC meeting and then ask for up to 25% of your SUs in your new award as an advance. If your grant expires in less than three months you can request a Supplement to your current grant. A third option is to transfer SUs from another machine on your grant to blacklight.
These actions of submitting a Renewal request, asking for an advance, asking for a Supplement or transferring SUs must all be done by the grant's PI through the XSEDE POPS system. This system must be accessed through your User Portal account. Detailed descriptions for these operations with your SUs are available online.
Another event that can interrupt your computing is the expiration of your grant. Even though you have distinct PSC and XSEDE Portal userids they are linked as far as expiration is concerned. When one expires the other will expire. To insure that you are not caught unawares by a grant expiration you should continually monitor your expiration dates with the blacklight xbanner command. The XRAC committee meets every three months. Thus, when your grant approaches three months from expiration the PI for the grant should submit a Renewal request to XRAC. If you have an expiring Startup grant the PI should almost always submit a Research proposal to XRAC. Renewals of Startup grants are rarely given. The data you gathered from using your Startup grant can be used as justification in your Research proposal.
If your grant is going to expire and you estimate that you will have SUs left when it does, you can respond differently. For example, you obtained a Startup grant, but did not use it very much. In this situation, instead of submitting a proposal for a new grant, you can apply for an Extension of up to six months for your existing grant.
These actions of submitting a Renewal request, submitting a Research proposal or asking for a grant Extension must all be done by the grant's PI through the XSEDE POPS system. This system must be accessed through your User Portal account. Detailed descriptions of the policies for these operations are available online.
A fourth impediment to the progress of your project is the need to add a user to a grant. The PI for the grant must add all users to a grant by using the Add User form, which must be accessed through the PI's User Portal account. Each user to be added must have already created, by using the User Portal, a User Portal userid. The policies for adding a user are described online
Running Gaussian on blacklight
Just getting an account on blacklight is not sufficient to give you access to Gaussian if you want to use Gaussian. You must fill out our online PSC Gaussian User Agreement to get access to Gaussian at PSC.
If you have questions about access to Gaussian send email to firstname.lastname@example.org.
Connecting to blacklight
There are three methods you can use to connect to blacklight.
You can connect to blacklight by using XSEDE's single-signon process, either through the XSEDE Portal or by downloading the necessary software components to your local machine. This method is especially useful if you have allocations on more than one XSEDE resource. To use this method you must have an XSEDE grant. To connect through the Portal you use your XSEDE Portal userid and password.
You can also connect to blacklight by using ssh to connect to blacklight.psc.teragrid.org. When you are prompted for a password by ssh enter your PSC Kerberos password. Your PSC Kerberos password is not the same as your XSEDE Portal password nor is your PSC userid necessarily the same as your XSEDE Portal userid. You will need to set your PSC Kerberos password before the first time you connect to blacklight using ssh. This method can be used with any type of blacklight grant.
Finally, you can connect to blacklight using a public-private key pair. The procedure for how to do this is available online. You will need to set your PSC Kerberos password before you can use this procedure. This method can be used with any type of blacklight grant.
Changing your PSC Kerberos password
Your PSC Kerberos password is not the same as your XSEDE Portal password. Resetting one password does not change the value of the other password. You use your PSC password to connect to blacklight and other PSC machines with ssh. You use your XSEDE password when you connect using the XSEDE Portal.
There are two ways to change or reset your PSC Kerberos password:
- Use the web-based PSC password change utility
- Use the kpasswd command to change your PSC Kerberos password. Do not use the passwd command.
You have the same password on all PSC production platforms. When you change your password, whether you do it via the online utility or via the kpasswd command on one PSC system, you change it on all PSC systems.
PSC Kerberos passwords must be at least 8 characters in length. They must also contain characters from at least 3 of the character classes:
- lower-case letters
- upper-case letters
- special characters, excluding ' and "
Finally, they must not be the same as any of your previous passwords.
You must change your blacklight password within 30 days of the date on your initial password form or your password will be disabled. We will also disable your password if you do not change it at least once a year. We will send you an email notice warning you that your password is about to be disabled in the latter case. See the PSC password policies for more information.
If you have a password issue and communicate with PSC about this problem through remarks do not include your password in an email messasge.
Changing your login shell
You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.
Accounting on blacklight
One core-hour on blacklight is one SU. Because resources are allocated by blades and not by cores--jobs do not share blades--your SU charges will always be based on core usage that is a multiple of sixteen. A one blade job that runs for one hour costs 16 SUs.
If you have more than one account, use the qsub option -W group_list to indicate to which account you want a job to be charged. The use of this option is discussed in the "Other qsub options" subsection of this document. To change your default account you must send email to email@example.com with this request.
User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.
Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to PIs. The system has extensive internal documentation.
File systems are file storage spaces directly connected to a system. There are currently two such areas available to you on blacklight.
This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of blacklight's compute and frontend processors. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to the Data Supercell*. The Data Supercell, PSC's file archival system, is discussed below.
You can check your home directory space usage using the command
Once you cd to your home directory you can issue this command.
This is blacklight's scratch area to be used as a working space for your running jobs. $SCRATCH is visible to all of blacklight's compute and frontend processors. $SCRATCH is a parallel file system. The current capacity of $SCRATCH is 291 Tbytes.
$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 21 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to the Data Supercell as soon as you can after you create them. The Data Supercell, PSC's file archival system, is discussed below. For information on improving your $SCRATCH IO performance see the section on IO optimization below.
You should use $SCRATCH as your scratch area. You should not use /tmp for this purpose. You should never need to write data to /tmp.
You can check your scratch directory space usage by using the command
Once you cd to $SCRATCH you can issue this command.
File repositories are discrete file storage spaces. You cannot, for example, open a file that resides in a file repository nor will you run a program on a file repository. You will not login to a file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on blacklight: the Data Supercell, PSC's file archival system.
- The Data Supercell (patent pending)
The Data Supercell is a complex disk-based archival system.
A variety of file transfer methods are supported to copy files to and from blacklight filesystems, including PSC's File ARchiver client (far), SSH (sftp or scp), and GridFTP (globus-url-copy or Globus Online).
Home Directory Quota Note: Since your blacklight home directory has a limited amount of space, you will not be able to transfer much data into your home directory. Exceeding your home directory quota will prevent you from writing more data to your home directory, and will adversely impact other operations you might want to perform. We recommend that you use your long-term storage on PSC's archiver, the Data Supercell.
You use far to transfer files between blacklight's filesystems and PSC's archiver system, the Data Supercell. In addition to file transfers, the far program can also be used to obtain listings of your files on the Data Supercell and for file and directory management needs (see far documentation). Globus Online, which is discussed below, can also be used to transfer your data between blacklight and the Data Supercell. Globus Online is our recommend method for transferring data between blacklight and the Data Supercell.
Note: We recommend that you execute far commands outside of your batch compute job scripts so that your jobs do not tie up compute processors and expend your computing allocation while your files are being transferred to/from data.psc.edu.
SSH sftp and scp
You can use the SSH file transfer clients, sftp and scp to transfer files between your local systems and blacklight filesystems. When using sftp or scp to transfer files to and from blacklight you do not connect directly to blacklight.You transfer files using a PSC high-speed data conduit named data.psc.xsede.org. You transfer files to and from blacklight via data.psc.xsede.org. If you are not connected to the data conduit from an XSEDE host you must use the name data.psc.edu.edu for the data conduit. If you have a graphical sftp or scp client application on your local system, you can use it to connect and authenticate to data.psc.xsede.org and transfer files accordingly. Use your PSC userid and password for authentication. If you need to (re)set your PSC password, you can do so via the kpasswd command on any PSC production system, or using the http://apr.psc.edu/ Web form.
You can use the command-line sftp client to transfer files to and from blacklight interactively. When using sftp from the command line, you first connect and authenticate to data.psc.xsede.org, and then issue commands at the
sftp> prompt to transfer and manage files:
joeuser is your PSC userid. The first time you connect to data.psc.xsede.org using sftp or scp, you may be prompted to accept the server's host key. Enter
yes to accept the host key:
The authenticity of host 'data.psc.xsede.org (220.127.116.11)' can't be established. RSA key fingerprint is d5:77:f2:d9:07:f6:32:b6:c3:eb:0d:d1:29:ed:9b:80. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'data.psc.xsede.org' (RSA) to the list of known hosts.
You will then be prompted to enter your PSC password:
firstname.lastname@example.org's password: Connected to data.psc.xsede.org. sftp>
sftp> prompt, you can then enter sftp commands (e.g.,
get, etc.) to manage and transfer your files to/from your blacklight. Enter a question mark (
?) for a list of available sftp commands.
Your default directory when using data.psc.xsede.org is your Data Supercell home directory. To access your blacklight scratch directory you must change your directory to be your blacklight scratch directory or use your entire scratch directory path in your sftp command. Your scratch directory path will be similar to /brashear/joeuser. To access your blacklight home directory you must change your directory to be your blacklight home directory or use your entire home directory path in your sftp command. Your home directory path will be similar to /usr/users/0/joeuser.
joeuser is your PSC userid, and entered commands appear in
- Where am I (on the data.psc.xsede.org server)?
sftp> pwd Remote working directory: /arc/users/joeuser
- Change directories to my blacklight
sftp> cd /brashear/joeuser Remote working directory: /brashear/joeuser
- Where am I on my local system?
sftp> lpwd Local working directory: /Users/JoeUser/Documents
- Change directories on my local system to
sftp> lcd /usr/local/projects/example/data
- Make a new directory called "
newdata" under my blacklight
sftp> mkdir /brashear/joeuser/newdata
- Copy a file (
file1.dat) from my current local directory to my blacklight
sftp> put file1.dat /brashear/joeuser/newdata/file1.dat Uploading file1.dat to /brashear/joeuser/newdata/file1.dat file1.dat 100% 1016KB 1.0MB/s 00:00
- Copy a file from my blacklight
/usr/local/projects/example/data/newfile1on my local system :
sftp> get /usr/users/0/joeuser/file1 /usr/local/projects/example/data/newfile1 Fetching /usr/users/0/joeuser/file1 to /usr/local/projects/example/data/newfile1 /usr/users/0/joeuser/file1 100% 31 0.0KB/s 00:00
- Exit from this sftpsession :
For scripted transfers, or transfers that you want to execute directly from your command-line shell, you can use the SSH scp client:
joeuser is your PSC userid, and entered commands appear in
bold, and you enter your PSC password when prompted):
- Copy my local file (
/usr/local/projects/example/data/file1.dat) to my blacklight
$ scp /usr/local/projects/example/data/file1.dat email@example.com:/brashear/joeuser firstname.lastname@example.org's password: file1.dat 100% 1016KB 1.0MB/s 00:00
- Copy the contents of my blacklight
/tmpon my local system (creating
/tmp/newdataand copying all the files there):
$scp -r email@example.com:/brashear/joeuser/newdata /tmp firstname.lastname@example.org's password: file2.dat 100% 1016KB 1.0MB/s 00:00 file3.dat 100% 1016KB 1.0MB/s 00:01 file1.dat 100% 1016KB 1.0MB/s 00:00
GridFTP globus-url-copy and Globus Online
XSEDE users may use GridFTP to transfer files to and from blacklight filesystems.
To use the command-line globus-url-copy client on a blacklight login node, first ensure that you have a current user proxy certificate for authentication with enough time on it to complete your transfer, e.g.:
joeuser@tg-login1:~> grid-proxy-info subject : /C=US/O=National Center for Supercomputing Applications/CN=Joe User issuer : /C=US/O=National Center for Supercomputing Applications/OU=Certificate Authorities/CN=MyProxy identity : /C=US/O=National Center for Supercomputing Applications/CN=Joe User type : end entity credential strength : 2048 bits path : /tmp/x509up_u99999 timeleft : 11:58:33
If the timeleft is not sufficient, or you get an "
ERROR: Couldn't find a valid proxy" message, then use myproxy-logon (or if you have your own long term user certificate, grid-proxy-init) to obtain a new user proxy certificate, e.g.:
joeuser@tg-login1:~> myproxy-logon -l joexsedeuser -t 24 Enter MyProxy pass phrase: A credential has been received for user joexsedeuser in /tmp/x509up_u99999.
joexsedeuser is your XSEDE User Portal login name,
-t 24 requests a 24-hour certificate, and the
MyProxy pass phrase entered is your XSEDE User Portal password.
You can then use globus-url-copy to transfer files to/from blacklight filesystems using the GridFTP server address gsiftp://gridftp.psc.xsede.org. This transfer will go through the PSC high-speed data conduit data.psc.xsede.org.
gsiftp:// URLs are absolute paths to files. This means that when referring to a file or directory in your blacklight
$SCRATCH directory, you must use
joeuser is your blacklight userid. Likewise, for your blacklight
$HOME directory, you must use
- List the files in my blacklight
joeuser@tg-login1:~> globus-url-copy -list gsiftp://gridftp.psc.xsede.org/brashear/joeuser gsiftp://gridftp.psc.xsede.org/brashear/joeuser/ file1.dat file2.dat file3.dat newdata/ olddata/
- Transfer a file (
testfile) from my scratch space on TACC lonestar to my PSC blacklight
joeuser@tg-login1:~> globus-url-copy -stripe -tcp-bs 32M \ gsiftp://gridftp.lonestar.tacc.xsede.org/scratch/99999/tg987654/testfile \ gsiftp://gridftp.psc.xsede.org/brashear/joeuser/newdata/
-tcp-bs 32Mare used to improve transfer performance, and
/scratch/99999/tg987654is your scratch directory on lonestar at TACC.
Globus Online users can access blacklight filesystems at endpoint psc#blacklight. You authenticate to the psc#blacklight endpoint using your XSEDE User Portal username and password. When connecting to the psc#blacklight endpoint on Globus Online, you may be redirected to the XSEDE OAuth page to enter your XSEDE User Portal username and password for authentication, after which you will automatically be returned to the Globus Online site to initiate your transfers.
If you do not enter a path for the psc#blacklight endpoint your destination will be your Data Supercell home directory. Thus you must explicitly enter the path to either your blacklight scratch directory or your blacklight home directory depending on which you want to be the target of your file transfer.
Improving Your File Transfer Performance
File transfer performance between your local systems and Blacklight filesystems can be significantly improved by ensuring that your local systems' networking parameters are optimized. Guidance is available at PSC's Enabling High Performance Data Transfers webpage.
For improved performance when using SSH (sftp or scp), we recommend using an SSH package that includes PSC's High Performance Networking (HPN) patches, e.g., GSI-OpenSSH. For instructions to build OpenSSH with PSC's HPN patches, consult the PSC High Performance SSH/SCP - HPN-SSH webpage.
The Intel C, C++ and Fortran compilers and the GNU Fortran, C and C++ compilers are installed on blacklight and they can be used to create OpenMP, pthreads, MPI, hybrid and serial programs. The commands you should use to create each of these types of programs are shown in the table below.
|Intel Fortran||ifort -openmp myopenmp.f90||ifort -pthread mypthread.f90||ifort mympi.f90 -lmpi||ifort -openmp myhybrid.f90 -lmpi||ifort myserial.f90|
|Intel C||icc -openmp myopenmp.c||icc -pthread mypthread.c||icc mympi.c -lmpi||icc -openmp myhybrid.c -lmpi||icc myserial.c|
|Intel C++||icpc -openmp myopenmp.cc||icpc -pthread mypthread.cc||icpc mympi.cc -lmpi -lmpi++||icpc -openmp myhyrid.cc -lmpi -lmpi++||icpc myserial.cc|
|GNU Fortran||gfortran -fopenmp myopenmp.f90||gfortran -pthread mypthread.f90||gfortran mympi.f90 -lmpi||gfortran -fopenmp myhybrid.f90 -lmpi||gfortran myserial.f90|
|GNU C||gcc -fopenmp myopenmp.c||gcc -pthread mypthread.c||gcc mympi.c -lmpi||gcc -fopenmp myhybrid.c -lmpi||gcc myserial.c|
|GNU C++||g++ -fopenmp myopenmp.cc||g++ -pthread mypthread.cc||g++ mympi.cc -lmpi -lmpi++||g++ -fopenmp myhybrid.cc -lmpi -lmpi++||g++ myserial.cc|
Man pages are available for ifort, icc and icpc and for gfortran, gcc and g++.
You should use the system-supplied SGI version of MPI. Thus, you should not use OpenMPI in your programs, even if you install your own version of OpenMPI. If the SGI version of MPI is inadequate for your needs send email to email@example.com.
The UPC compiler is installed on blacklight. Online instructions are available for its use.
Two versions of Java are available on blacklight: an IBM version and a Sun version. You should load the module for the version you want to use. If you issue the command
module load java
you will load the Sun version. Once you load a Java module the Java compiler and interpreter will then be available to use. A java man page is also available.
Torque, an open source version of the Portable Batch System (PBS), controls all access to blacklight's compute processors, for both batch and interactive jobs. Currently blacklight has two queues: the batch queue and the debug queue. Interactive jobs can run in the debug queue and the batch queue and the method for doing so is discussed below.
Batch queue jobs that ask for 256 or fewer cores can ask for a maximum of 96 hours of walltime. Batch queue jobs that ask for more than 256 cores can ask for a maximum walltime of 48 hours.
The maximum walltime for jobs in the debug queue is 30 minutes. You must request 16 cores. The debug queue is not to be used for short production runs.
Jobs submitted to the batch queue are actually sent by the system into subqueues based on the their walltime and core requests. You only submit jobs directly into the batch queue.
Jobs that ask for 1440 or fewer cores and 48 or fewer hours are slotted into the batch_r or batch_r1 subqueues. Jobs that ask for more than 48 hours of walltime are slotted into the batch_l or batch_l1 subqueues. Jobs that request more than 1440 cores are slotted to a separate queue where they receive special handling.
You determine how much memory your job will be allocated through the value of your core request. The maximum number of cores a shared memory job can request is 2048. If you want to run a hybrid job using MPI and threads or a UPC job that needs more than 2048 cores send email to firstname.lastname@example.org to make special processing arrangements.
The batch queue is basically a FIFO queue. However, there are mechanisms in place to prevent a single user from dominating the batch queue and to prevent idle time on the machine. The result is some deviation from a strictly FIFO scheme.
There are suggestions below on how to improve your job turnaround. We will modify the scheduling policies on blacklight to meet user needs. If you have comments about the scheduling policies on blacklight or find that they do not meet your needs send email to email@example.com.
Sample batch jobs
To run a batch job on blacklight you submit a batch script to the scheduler. A job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.
A sample job script to run an OpenMP program is
#!/bin/csh #PBS -l ncpus=16 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myopenmp . #run my executable setenv OMP_NUM_THREADS $PBS_NCPUS omplace -nt $OMP_NUM_THREADS ./myopenmp ja -chlst
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job.
The four #PBS lines are PBS directives.
- #PBS -l ncpus=16
This directive determines the number of cores allocated to your job. For performance reasons the actual allocation of resources is done by blades, with each blade containing sixteen cores. Jobs do not share blades. You must specify a value for ncpus that is a multiple of sixteen. Within your batch script the environment variable PBS_NCPUS is set to the value of ncpus.
Each blade has 128 Gbytes of physical memory. If your job exceeds the amount of physical memory available to it--a job with a ncpus value of 64 will run on 4 blades and thus have 512 Gbytes of memory available to it--it will be killed by the system with a message similar to
PBS: Job killed: cpuset memory_pressure 10562 reached/exceeded limit 1 (numa memused is 134200964 kb)
written to its stderr. A cpuset is the set of blades--cores and associated memory--assigned to your job. Memory pressure is a metric that indicates whether the processes on a blade are attempting to free up in-use memory on the blade to satisfy additional memory requests. Since this use of memory would result in significantly lower performance, a job that attempts to do this is killed by the system. For more information about cpusets and memory pressure see the man page man 4 cpuset.
If this happens to your job you should resubmit it and ask for more cores. The output from the ja command, which is discussed below, can help you determine how many blades your job needs. If asking for more cores does not resolve this issue contact firstname.lastname@example.org. When you do so include your jobids and numa memused value from the error message in your email to remarks.
Blacklight is a large-memory machine. Thus, the number of cores you select will often be determined by the amount of memory your program needs. Below is a table that gives the amount of memory available for representative numbers of requested cores.
# Cores Memory
16 128 64 512 256 2048 512 4096 1024 8192 1424 13952
These values are only examples. You can request any number of cores that is a multiple of sixteen.
- #PBS -l walltime=5:00
The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.
Your stdout and stderr files are each limited to 20 Mbytes if they are not redirected to a file. If your job exceeds either of these limits it will be killed by the system. If you have a program that you think will exceed either of these limits you should redirect either your stdout or stderr output or both to a $SCRATCH file.
- #PBS -q batch
The final PBS directive requests that your job be run in the batch queue.
The remaining lines in the script are comments and command lines.
- set echo
This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use
The ja command turns on job accounting for your job. This allows you to obtain information on the elpased time and memory and IO usage of your program, plus other data.
You must pair the command with another ja command at the end of your job. The option -t to this second ja command turns off job accounting and writes your accounting data to stdout. The other options to the second example ja command determine what output you will receive from ja. We recommend the -chls options because we think they will provide detailed but useful information about your job's processes. However, you can look at the man page for ja to see what reporting options you want to use.
There is no overhead to using ja. We strongly recommend that you use ja so you can understand the resource usage of your jobs, which you can use when you submit future jobs. The output from ja can also be used for debugging and performance improvement purposes.
If your job terminates normally and you have included the -t option with your second ja command, your ja output is written to your job's stdout.
- Comment lines
The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.
- setenv OMP_NUM_THREADS $PBS_NCPUS
This command sets the number of threads your OpenMP program will use. Normally, you will set the variable OMP_NUM_THREADS to $PBS_NCPUS. This will cause each of your OpenMP threads to run on its own core. However, blacklight's hardware has a feature known as hyperthreading. Each core on blacklight can run two threads. The use of hyperthreads can improve the performance of some programs, but it can also degrade the performance of other programs. On most programs it will have no effect. Thus, the use of hyperthreading is very application-dependent. To use hyperthreading you set the value of OMP_NUM_THREADS to twice the number of cores you requested with ncpus. The variable PBS_HT_NCPUS is set to this value.
You can also set the value of OMP_NUM_THREADS such that you have more than 2 threads running per core, but blacklight's hardware is designed so that the first 2 threads per core will use hyperthreading. Thus, 2 threads per core will probably provide better performance than more than 2 threads per core. However, this is also application-dependent. Therefore, you should test your application to see if you should use 1, 2 or more threads per core. The maximum number of threads per core you can currently request is 16. If you need more than 16 threads per core send email to email@example.com.
- omplace -nt $OMP_NUM_THREADS ./myopenmp
This command runs your executable. The omplace command insures that your threads do not migrate across your cores.
A sample job to run a pthreads program is
#!/bin/csh #PBS -l ncpus=16 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mypthread . #run my executable setenv OMP_NUM_THREADS $PBS_NCPUS omplace -nt $OMP_NUM_THREADS ./mypthread ja -chlst
This script is identical to the OpenMP script, except for the name of the executable. The information about thread counts is also the same as for OpenMP programs.
A sample job to run a Java program is
#!/bin/csh #PBS -l ncpus=16 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo source /usr/share/modules/init/csh ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/MyJavaApp . module load java #run my executable java -XX:ParallelGCThreads=16 MyJavaApp ja -chlst
This script is similar to the OpenMP script. The source command is needed because your job is using the module command. In place of the omplace command you use the java interpreter to run your program.
The ParallelGCThreads option is used to specify the number of Java threads to generate for the purposes of garbarge collection. Normally for performance reasons you will generate one thread per core, although this is application dependent. In this example you are asking for 16 threads, which will give one thread per core. If you specify two threads per code your program will use hardware hyperthreads. You can specify more than 2 threads per core up to a maximum of 16 threads per core. If you need to use more than 16 threads per core send email to firstname.lastname@example.org.
If you are using the IBM version of Java the option to request 16 threads would be
java -Xgcthreads16 MyJavaApp
The Java system call Runtime.getRuntime().availableProcessors() will always return 4096. To get the correct vallue for your number of cores you should instead call System.getenv("PBS_NCPUS") and multiply the returned value by two.
A sample job to run an MPI program is
#!/bin/csh #PBS -l ncpus=16 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #run my executable mpirun -np $PBS_NCPUS ./mympi ja -chlst
This script is identical to the OpenMP script except when you run your executable. You do not have to set the variable OMP_NUM_THREADS, but you do have to use the mpirun command to launch your executable on blacklight's compute processors. The value for the -np option is the number of your MPI tasks. You should normally set -np to $PBS_NCPUS. This will run each of your MPI tasks on its own core. If you want to use hyperthreading you should set the value of -np to $PBS_HT_NCPUS. This will run two hardware threads on a single core. Whether this will improve the performance of your MPI program is application-dependent. It is unlikely that setting the value of -np to a value greater than $PBS_HT_NCPUS will improve the performance of your MPI program. You cannot currently request more than 16 threads per core. If you need to use more than 16 threads per core send email to email@example.com.
You must use mpirun to run your MPI executable. In addition, MPI programs will only run on blacklight's compute nodes. Thus, they must be run using qsub.
A sample job to run a UPC program is
#!/bin/csh #PBS -l ncpus=16 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja source /usr/share/modules/init/csh #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myupc . #load the UPC module module load sgi-upc #run my executable mpirun -np $PBS_NCPUS ./myupc ja -chlst
This script is identical to the MPI script above--other than the loading of the sgi-upc module and the associated source command--because SGI UPC uses MPI in its implementation. Thus, the above discussions of the mpirun command and threads apply to UPC jobs.
Information on compiling UPC programs is available online. If you load the sgi-upc module a man page is also available.
A sample job to run a hybrid OpenMP and MPI program is
#!/bin/csh #PBS -l ncpus=256 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myhybrid . #run my executable setenv OMP_NUM_THREADS 16 mpirun -np 16 omplace -nt 16 ./myhybrid ja -chlst
This script is identical to the above two scripts except when you run your executable. You use a combination of the mpirun and omplace commands to run your hybrid program. The value of the -np option to the mpirun command is the number of your MPI tasks. The value of the -nt option to the omplace command is the number of your OpenMP threads per MPI task. The value of the -nt option and the value you set OMP_NUM_THREADS to must be the same. The product of the value of the -nt option and the -np option should be the value of your PBS ncpus specification, if you do not want to use hyperthreading. If you do want to use hyperthreading, the product of these two values should equal twice your ncpus value. To use hyperthreading you can use any values for -nt and -np as long as their product is twice your ncpus value. You can also choose values for -nt and -np such that their product is greater than twice your ncpus value. If you do this your first two threads per core will use hyperthreading. Which values you should select for -nt and -np is application dependent. You cannot currently request more than 16 threads per core. If you need to use more than 16 threads per core send email to firstname.lastname@example.org.
The omplace command insures that each of your OpenMP threads do not migrate across your cores. You must use mpirun to run your hybrid executable or it will run on a frontend and degrade overall system performance.
A sample job to run a serial program is
#!/bin/csh #PBS -l ncpus=32 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myserial . #run my executable ./myserial ja -chlst
To run a serial program you just give the name of your program as a statement in your batch script. Since you are running a serial program hyperthreading is not an issue. You use your ncpus paramater to ask for the number of cores you need so that you have enough memory in the blades allocated to your job for your program to run. Your serial program will have access to all the memory in all the blades allocated to your job. In the above example your program will have access to 256 Gbytes of memory.
After you create your batch script you submit it to PBS with the qsub command.
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the first sample script above and submit the script with the command
qsub -l ncpus=16 -l walltime=5:00 -j oe -q batch myscript.job
Command-line directives override directives in your scripts.
Flexible walltime requests
Two other qsub options are available for specifying your job's walltime request.
-l walltime_min=HH:MM:SS -l walltime_max=HH:MM:SS
You can use these two options instead of "-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround in several circumstances.
For example, to accommodate large jobs, the system actively drains blades to create dynamic reservations. The blades being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill blades.
As an example, if your job requests 64 cores and a range of walltime between 2 and 4 hours and a 64-core slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.
Another situation in which specifying a flexible walltime could improve your turnaround is the period leading up to a full drain for system maintenance. The system will not start a job that will not finish before the system maintenance time begins. A job with a flexible walltime could start if the flexible walltime range overlaps the period when the maintenance time starts. A job with a fixed walltime that would not finish until after the maintenance period begins would not be started.
If the system starts one of your jobs with a flexible walltime request, the system selects a walltime within the two specified limits. This walltime will not change during your job's execution. You can determine the actual walltime your job was assigned by examining the Resource_List.walltime field of the output of the qstat -f command. The command
qstat -f $PBS_JOBID
will give this output for the current job. You can capture this output to find the value of the Resource_List.walltime field.
You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit so that the file writing is completed when the limit is reached. The functions mpi_wtime and omp_get_wtime can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.
You may also want to save time at the end of your job to allow your job to transfer files after your program ends but before your job ends. You can use the timeout command to specify in seconds how long you want your program to run. Once your job determines what its actual walltime is you can, after subtracting the amount of time you want for file transfer at the end of your job, use this value in a timeout command. For example, assume your job is assigned a walltime of 1 hour and you want your program to stop 10 minutes before your job ends to allow your job to have adequate time for file transfer. To accomplish this you could use a command like the following
timeout --timeout=$PROGRAM_TIME -- mpirun -np 32 ./mympi
The example assumes that your script has retrieved your job's walltime, converted it to seconds--values given to timeout must be in seconds--subtracted 600 from it and assigned the value of 3000 to the variable PROGRAM_TIME. You will probably also want to provide this value to your program. Your program can then use this value to appropriately write out checkpoint files. When your program ends your job will have time to perform necessary file transfers before your job ends.
For more information on the timeout command see the timeout man page. If you want assistance on the procedures needed to capture your job's actual walltime or to determine when your job should write checkpoint files send email to email@example.com.
How to improve your turnaround
We have several suggestions for how to improve your job turnaround. Firstly, you should try to be as accurate as possible in estimating the walltime request for your job. Asking for more time than your job will actually need will almost certainly result in poorer turnaround for your job. Thus, unreflectively asking for the maximum walltime you can ask for a job will almost always result in poorer turnaround.
Our second recommendation is that you always use flexible walltime requests if possible. This is especially helpful if your minimum walltime in your pair of walltime values is less than 8 hours.
Finally, due to system limitations, we must limit the number of concurrent 16-core jobs on blacklight. Since the number of queued 16-core jobs usually is above this limit, if you are running 16-core jobs, it is to your advantage to pack multiple 16-core executions into a single job. How to pack jobs is discussed in the section below on "Packing jobs."
A form of interactive access is available on blacklight by using the -I option to qsub. For example, the command
qsub -I -l ncpus=16 -l walltime=5:00
requests interactive access to 16 cores for 5 minutes in the. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI or hybrid program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
X-11 connections in interactive use
In order to use any X-11 tool, you must also include -X on the qsub command line:
qsub -X -I -l ncpus=16 -l walltime=5:00
This assumes that the DISPLAY variable is set. Two ways in which DISPLAY is automatically set for you are:
- Connecting to blacklight with ssh -X blacklight.psc.teragrid.org
- Enabling X-11 tunneling in your Windows ssh tool
Fluent and TAU are among the packages which require X-11 connections.
Other qsub options
Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job. You should specify your full Internet email address when using the -M option.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on blacklight.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- Indicates to which charge_id you want a job to be charged. If you only have one grant on blacklight you do not need to use this option; otherwise, you should charge each job to the appropriate grant.
You can see your valid charge_ids by typing
groupsat the blacklight prompt. Typical output will look like
sy2be6n ec3l53p eb3267p jb3l60q
Your default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify
-W group_listfor your job, this is the grant that will be charged.
If you want to switch your default charge_id, send email to firstname.lastname@example.org.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins
Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Running many small jobs places a great burden on the scheduler and is probably inconvenient for you. An alternative is to pack many executions into a single job, which you then submit to PBS with a single qsub command. The basic method to use to pack jobs is to run each program execution in the background and place a wait command after all your executions. A sample job to pack serial executions is
#!/bin/csh #PBS -l ncpus=128 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executables and input files to $SCRATCH cp $HOME/myserial* . cp $HOME/serial* . #run my executables dplace -c 0 ./myserial1 < serial1.dat & dplace -c 32 ./myserial2 < serial2.dat & dplace -c 64 ./myserial3 < serial3.dat & dplace -c 96 ./myserial4 < serial4.dat & wait ja -chlst
Each serial execution will run on 2 blades. The dplace command insures that each execution will run on its own set of 2 blades. The executions will run concurrently. This same approach using the dplace command can be used to pack jobs with MPI executables.
To pack a job with executables that use threads such as OpenMP executables you should replace the dplace command with the omplace command. A sample job to pack OpenMP executables is
#!/bin/csh #PBS -l ncpus=128 #ncpus must be a multiple of 16 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executables and input files to $SCRATCH cp $HOME/myopen* . #run my executables omplace -nt 32 -c 0 ./myopenmp1 < myopenmp1.dat & omplace -nt 32 -c 32 ./myopenmp2 < myopenmp2.dat & omplace -nt 32 -c 64 ./myopenmp3 < myopenmp3.dat & omplace -nt 32 -c 96 ./myopenmp4 < myopenmp4.dat & wait ja -chlst
Packing jobs is especially useful to do if you are running 16-core jobs. Due to system limitations, we must limit the number of concurrent 16-core jobs on blacklight. Since the number of queued 16-core jobs usually exceeds this limit, if you are running 16-core jobs, it is to your advantage to pack multiple 16-core executions into a single job.
If you have questions about packing jobs send email to email@example.com. For more information about dplace and omplace see the man pages for dplace and omplace.
Monitoring and Killing Jobs
The qstat -a command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. To get the actual number of cores a job is using you must divide the displayed value by two. For running jobs it shows the amount of walltime the job has already used
The commands qstat -s, qstat -f and pbsnodes -a can be used to give status information about the system and your jobs. The comments that these commands provide can be used to determine why your jobs have not started. The qstat -f command take a jobid as an argument.
The qdel command is used to kill queued and running jobs. An example is the command
The argument to qdel is the jobid of the job you want to kill, which you are shown when you submit your job or you can get it with the qstat command. If you cannot kill a job you want to kill send email to firstname.lastname@example.org.
Your first few runs should be on a small version of your problem. Your first run should not be for your largest problem size. It is easier to solve code problems if you are using fewer processors. This strategy should be followed even if you are porting a working code from another system.
Do not run a debugging run on any of blacklight's front ends. You should always run a blacklight program with qsub.
The idb and gdb debuggers are available on blacklight. The gdb debugger has a man page. Information for idb is available online. This online documentation has links to more idb reference material. Send email to email@example.com if you want another debugger to be installed.
Several compiler options can be useful to you when you are debugging your program. If you use the -g option to the Intel or GNU compilers, the error messages you receive when your program fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code you should not use the -g option or your program will run slower.
The -check bounds option to the ifort compiler will cause your program to tell you if it exceeds an array bounds while running.
Variables on blacklight are not automatically initialized. This can cause your program to fail if it relies on variables being initialized. The -check uninit and -ftrupuv options to the ifort compiler will catch certain cases of uninitialized variables, as will the -Wall and -O options to the GNU compilers.
There are more options to the Intel and GNU compilers that may assist you in your debugging. For more information see the appropriate man pages.
The key to making core files on blacklight is to allow them to be written by increasing the maximum file size allowable for core files. The default size is 0 bytes. If you are using sh-type shells you do this by issuing the command
ulimit -c unlimited
For csh-type shells you issue the command
limit coredumpsize unlimited
Core files are created in directory ~/tmp. For more information about core files issue the command
man 5 core
Little endian versus big endian
The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Blacklight is a little endian machine, which means that the low-order byte of a number is stored in the memory location with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a big endian machine.
If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command
echo 'puts $tcl_platform(byteOrder)' | tclsh
You can read a big endian file on blacklight if you are using the Intel ifort compiler. Before you run your program issue the command
setenv FORT_CONVERTn big_endian
for each Fortran unit number from which you are reading a big endian file. For 'n' substitute the appropriate unit number.
You can calculate your code's Mflops rate using the TAU utility. The TAU examples show how to determine timing data and floating point operation counts for your program, from which you can calculate your Mflops rate.
Cache performance can have a significant impact on the performance of your program. Each blacklight core has three levels of cache. The primary data and instruction caches are 32 Kbytes each. The L2 cache is 256 Kbytes. The L3 cache, which is shared by the 8 processors on a core, is 24 Mbytes. When hyper-threading is enabled the two threads on a core share the L1 and L2 caches.
You can measure your program's cache miss rate for each of the available caches by setting the appropriate counters when using the TAU utility. If you need assistance in measuring or improving your cache performance send email to firstname.lastname@example.org.
Blacklight's Nehalem processors have a feature referred to as Turbo Boost. Under certain workload conditions its processor cores can automatically and dynamically run faster than their base clockrate of 2.27 GHz. Although the activation of the Turbo Boost feature is application dependent, we have found that it is most often activiated when only a few cores per processor are being used, because its activation depends on the processor's power consumption and temperature.
Collecting timing data
Collecting timing data is essential for measuring and improving program performance. We recommend five approaches for collecting timing data. The ja and /usr/bin/time utilities can be used to collect data at the program level. They report results to the hundredths of seconds. The TAU utility and the omp_get_wtime and MPI_Wtime functions can be used to collect timing data at a finer grain. The default precision for TAU is microseconds, but the -linuxtimers or -papi option can be used to obtain nanosecond precision. The precision for omp_get_wtime is microseconds, while the precision for MPI_Wtime is nanoseconds.
Blacklight's operating system creates a file system out of its blade memory. Thus, your program can perform IO to blade memory rather than to disk. Memory IO is several orders of magnitude faster than disk IO. However, each blacklight job can only perform memory IO to the blades associated with that job. A job cannot write to the memory of blades assigned to other jobs.
The environment variable $SCRATCH_RAMDISK is set to point to the memory associated with each job. Unlike $SCRATCH, this variable is given a new value for each job. Otherwise, this variable can be treated like $SCRATCH. From within your job, you can cd to it, you can copy files to and from it, and you can use it to open files.
Memory IO is faster than disk IO, but it does have disadvantages. Each job's memory filespace is cleared whenever the job terminates, whether normally or abnormally. Thus, if you are using memory IO you must copy your memory files back from $SCRATCH_RAMDISK before your job ends or the files are lost. If your job terminates abnormally your files will be lost. Moreover, memory IO is limited in size relative to disk IO. Each job can only use the memory associated with that job. Furthermore, memory IO is limited to the memory available after memory is allocated for your program. Therefore, the use of memory files is best suited to IO-intensive jobs that perform IO to lots of small files.
If your program reads or writes large files you should use $SCRATCH. Your $HOME space is limited. In addition, the $SCRATCH file space is implemented using the Lustre parallel file system. A program that uses $SCRATCH can perform parallel IO and thus can significantly improve its performance. File striping can be used to tune your parallel IO performance and is particularly effective for files that are 1 Gbyte or larger.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. This is how you can use Lustre as a parallel file system.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as ls -l or unlink will take longer for striped files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is
lfs setstripe filename -c stripe-count
A value of -1 for the stripe count means the file should be spread across all the available OSTs.
For example, the command
lfs setstripe bigfile.out -c -1
sets the stripe count for bigfile.out to be all available OSTs.
lfs setstripe manyfiles.out -c 1
has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application. A value of -1 for stripe count will probably give you the best performance if you are going to use file striping, but you should try several values. The maximum value you can give for stripe count on blacklight is currently 8.
There is a man page for lfs on blacklight. Online documentation for Lustre is also available. If you want assistance with what striping strategy to follow send email to email@example.com.
Third-party routines can often perform better than routines you code yourself. You should investigate whether there is a third-party routine available to replace any of the routines you have written yourself.
Performance monitoring tools
We have installed several performance monitoring tools on blacklight. The TAU utility is a performance profiling and tracing tool. The PAPI utility can be used to access the hardware performance counters on blacklight. We intend to install more performance tools on blacklight. If you want assistance in using any of these tools or have a utility you would like us to install send email to firstname.lastname@example.org.
Performance improvement assistance: the Memory Advantage Program
Blacklight is a very large hardware-coherent shared memory machine. Blacklight is thus suitable for a range of memory-intensive computations that cannot readily be deployed on a distributed-memory machine.
PSC has established the Memory Advantage Program (MAP) to enable users to take advantage of blacklight's unique capabilities. MAP includes consulting assistance from PSC, special queue handling if necessary and service unit discounts.
To participate in MAP you should send an email to email@example.com with a description of your scientific problem and any information you have on how effectively your program is currently using blacklight's shared memory. A PSC Scientific Specialist will then contact you to troubleshoot problems, provide advice on the use of debugging and performance analysis tools and procedures, and offer suggestions on fixes and optimizations. During this consultation process you will be able to make benchmarking, debugging and test runs at a 50% discount for a period of up to 4 weeks.
You can also send email to firstname.lastname@example.org if you want optimization assistance in areas other than memory usage.
A list of software packages installed on blacklight is available. If you would like us to install a package that is not in this list send email to email@example.com.
The Module Command
Before you can run many software packages, you must define paths and other environment variables. To use a different version of a package, these definitions often have to be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see the module documentation.
Blacklight and XSEDE
Blacklight is on XSEDE. Thus, you have access to XSEDE methods of connecting to blacklight and to XSEDE methods of transferring files between your local machine and blacklight and arc. For information on using XSEDE see the general online documentation for XSEDE.
As a user of blacklight, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently.
You will also periodically receive email from PSC with information about blacklight. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on blacklight be sent to the PSC Allocations Coordinator. In addition, if your research was funded by the NSF you should log your publications at the XSEDE Portal. We also request that you include an acknowledgement of PSC in your publication.
Reporting a Problem
You have several options for reporting problems on blacklight.
- If you are an XSEDE user you can send email to firstname.lastname@example.org, mentioning PSC in the subject line. You will get an acknowledgement from the XSEDE Operations Center, and then you will be contacted by PSC staff.
- If you are a non-XSEDE user you can send email to email@example.com.
- You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.