Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

Salk

PopleSalk is an SGI Altix SMP machine with 144 cores and 288 Gbytes of shared memory, dedicated to biomedical research.  Salk is named for Dr. Jonas Salk, developer of the first inactivated polio vaccine.
 

System Configuration

Hardware

Salk is an SGI Altix 4700 shared-memory NUMA system comprising 36 blades. Each blade holds 2 Itanium2 Montvale 9130M dual-core processors. The four cores on a blade share 8 Gybtes of local memory. The processors are connected by a NUMAlink interconnect. Through this interconnect the local memory on each processor is accessible to all the other processors. Each processor runs an enhanced version of the SuSE Linux operating system.

There are multiple frontend processors, which are also Itanium2 processors and which run the same version of SuSE Linux as the compute processors. You login to one of these frontend processors, not to the compute processors.

Software

The Intel C, C++ and Fortran compilers and the Gnu C and C++ compilers are installed on salk, as are the facilities to enable you to run OpenMP, MPI and hybrid OpenMP and MPI programs.

Access

Getting an account on salk

To get an account on salk you must fill out the appropriate online form linked to from the NRBSC resources page.

Connecting to salk

To connect to salk you must ssh to salk.psc.edu. When you are prompted for a password enter your PSC Kerberos password.

Changing your password

There are two ways to change or reset your PSC Kerberos password:

You have the same password on all PSC production platforms. When you change your password, whether you do it via the online utility or via the kpasswd command on one PSC system, you change it on all PSC systems.

PSC Kerberos passwords must be at least 8 characters in length. They must also contain characters from at least 3 of the character classes:

  1. lower-case letters
  2. upper-case letters
  3. digits
  4. special characters, excluding ' and "

Finally, they must not be the same as any of your previous passwords.

You must change your salk password within 30 days of the date on your initial password form or your password will be disabled. We will also disable your password if you do not change it at least once a year. We will send you an email warning you that your password is about to be disabled in the latter case. See the PSC password policies for more information. If your password is disabled send email to This email address is being protected from spambots. You need JavaScript enabled to view it. to have it reset.

Changing your login shell

You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.

Accounting on salk

One core-hour on salk is one SU.

If you have more than one account, use the qsub option -W group_list to indicate to which account you want a job to be charged. The use of this option is discussed in the "Other qsub options" subsection of this document. To change your default account you must send email to This email address is being protected from spambots. You need JavaScript enabled to view it. with this request.

User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.

Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to grants PIs. The system has extensive internal documentation.

Storing Files

File Systems

File systems are file storage spaces directly connected to a system. There are currently two such areas available to you on salk.

$HOME

This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of salk's compute and frontend processors. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to the Data Supercell*. The Data Supercell, PSC's file archival system, is discussed below.

$SCRATCH

This is salk's scratch area to be used as a working space for your running jobs. $SCRATCH is visible to all of salk's compute and frontend processors. You should use the name $SCRATCH to refer to your scratch area since we may change its implementation.

$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 7 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to the Data Supercell as soon as you can after you create them. The Data Supercell (patent pending), PSC's file archival system, is discussed below.

File Repositories

File repositories are discrete file storage spaces. You cannot, for example, open a file that resides in a file repository nor will you run a program on a file repository. You will not login to file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on Salk: the Data Supercell, PSC's file archival system.

The Data Supercell (patent pending)
The Data Supercell is a disk-based archival system.

Transferring Files

You can use either the scp or the kftp program to transfer files between your remote machine and salk. Which method will perform better varies based on location. Therefore you should try both approaches and see which performs better for you. If you want assistance in improving the performance of your file transfers send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Creating Programs

The Intel C, C++ and Fortran compilers and the Gnu C and C++ compilers are installed on salk and they can be used to create OpenMP, MPI, hybrid and serial programs. The commands you should use to create each of these types of programs are shown in the table below.

  OpenMP MPI Hybrid Serial
Intel Fortran ifort -openmp myopenmp.f ifort mympi.f -lmpi ifort -openmp myhybrid.f -lmpi ifort myserial.f
Intel C icc -openmp myopenmp.c icc mympi.c -lmpi icc -openmp myhybrid.c -lmpi icc myserial.c
Intel C++ icpc -openmp myopenmp.cc icpc mympi.cc -lmpi -lmpi++ icpc -openmp myhyrid.cc -lpmi -lmpi++ icpc myserial.cc
Gnu C gcc -openmp myopenmp.c gcc mympi.c -lmpi gcc -openmp myhybrid.c -lmpi gcc myserial.c
Gnu C++ g++ -openmp myopenmp.cc g++ mympi.cc -lmpi -lmpi++ g++ -openmp myhybrid.cc -limpi -lmpi++ g++ myserial.cc

Man pages are available for ifort, icc and icpc and for gcc and g++.

The UPC compiler is also installed on salk. Online instructions for its use are available.

A native Java compiler and interpreter are available on salk. Issue the command

 

    module load jrockit/5.0

to get access to the javac and java commands.

Running Jobs

Queue structure

Torque, an open source version of the Portable Batch Scheduler (PBS), controls all access to salk's compute processors, for both batch and interactive jobs. Currently salk has two queues: the batch queue and the debug queue. Interactive jobs can run in the batch queue and the debug queue and the method for doing so is discussed below.

The maximum walltime for the batch queue is 48 hours and the maximum number of cores you can request is 132. The maximum walltime for the debug queue is 30 minutes and the maximum number of cores you can request is 8.

We plan to create several other queues to meet user needs. If you would like to make a suggestion about salk's queue structure send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Scheduling policies

The batch and debug queues are currently FIFO queues with mechanisms in place to prevent a single user from dominating either queue and to prevent idle time on the machine. The result is some deviation from a strictly FIFO scheme. We will modify the scheduling policies on salk to meet user needs. If you have suggestions or comments about the scheduling policies on salk or find that they do not meet your needs send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Sample batch jobs

To run a batch job on salk you submit a batch script to the PBS system. A PBS job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.

A sample job script to run an OpenMP program is

 

#!/bin/csh
#PBS -l ncpus=4
#ncpus must be a multiple of 4
#PBS -l walltime=5:00
#PBS -j oe
#PBS -q batch
set echo
ja
#move to my $SCRATCH directory
cd $SCRATCH
#copy executable to $SCRATCH
cp $HOME/myopenmp .
#run my executable
setenv OMP_NUM_THREADS 4
./myopenmp
ja -chlst

The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command, then you must use the -l option to your shell command.

The four #PBS lines are PBS directives.

#PBS -l ncpus=4

This directive specifies the number of cores to allocate for the job. For performance reasons the actual allocation of cores is done by blades, with each blade containing four cores. You must request cores in multiples of four. Jobs do not share blades.

The value of ncpus is the number of cores to allocate for the job. The number of cores must be a multiple of four, or the job submission will fail. Within your batch script the environment variable PBS_NCPUS is set to the number of cores you requested.

Each blade has 8 Gbytes of physical memory. If your job exceeds the amount of physical memory available to it--a job requesting 16 cores will run on 4 blades and thus have 32 Gbytes of memory available to it--it will be killed by the system with a message similar to

PBS: Job killed: cpuset memory_pressure X reached/exceeded limit 1

written to its stderr. A cpuset is the set of blades--cores and associated memory--assigned to your job. Memory pressure is a metric that indicates whether the processes on a blade are attempting to free up in use memory on the blade to satisfy additional memory requests. Since this use of memory would result in significantly lower performance, a job that attempts to do this is killed by the system. For more information about cpusets and memory pressure see the man page man 4 cpuset.

If this happens to your job you should resubmit it and ask for more cores. The output from the ja command, which is discussed below, can help you determine how many blades your job needs.

#PBS -l walltime=5:00

The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.

#PBS -j oe

The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.

Your stdout and stderr files are each limited to 20 Mbytes. If your job exceeds either of these limits it will be killed by the system. If you have a program that you think will exceed either of these limits you should redirect your stdout or stderr output to a $SCRATCH file. Another option is to run your job from $SCRATCH.

#PBS -q batch

The final PBS directive requests that your job be run in the batch queue.

The remaining lines in the script are comments and command lines.

set echo

This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use

set -x

instead.

ja

The ja command turns on job accounting for your job. This allows you to obtain information on the elpased time and memory and IO usage of your program, plus other data.

You must pair the command with another ja command at the end of your job. The option -t to this second ja command turns off job accounting and writes your accounting data to stdout. The other options to the second example ja command determine what output you will receive from ja. We recommend these options because we think they will provide detailed but useful information about your job's processes. However, you can look at the man page for ja to see what reporting options you want to use.

There is no overhead to using ja. We strongly recommend that you use ja so you can understand the resource usage of your jobs, which you can use when you submit future jobs. The output from ja can also be used for debugging and performance improvement purposes.

Comment lines

The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.

setenv OMP_NUM_THREADS 4
This command sets the number of threads you want your OpenMP program to use. You should set this value to the number of cores you requested with your PBS nodes directive so each of your threads will run on its own core.
./myopenmp
This command runs your executable.

A sample job to run an MPI program is

#!/bin/csh
#PBS -l ncpus=4
#ncpus must be a multiple of 4
#PBS -l walltime=5:00
#PBS -j oe
#PBS -q batch
set echo
ja
#move to my $SCRATCH directory
cd $SCRATCH
#copy executable to $SCRATCH
cp $HOME/mympi .
#run my executable
mpirun -np 4 ./mympi
ja -chlst

This script is identical to the OpenMP script except when you run your executable. You do not have to set the variable OMP_NUM_THREADS, but you have to use the mpirun command to launch your executable on salk's compute processors. The value for the -np option is the number of cores you want your program to run on. You should set -np to the number of cores you requested with your PBS ncpus directive. You must use mpirun to run your MPI executable or it will run on a frontend and degrade overall system performance.

A sample job to run a hybrid OpenMP and MPI program is

#!/bin/csh
#PBS -l ncpus=64
#ncpus must be a multiple of 4
#PBS -l walltime=5:00
#PBS -j oe
#PBS -q batch
set echo
ja
#move to my $SCRATCH directory
cd $SCRATCH
#copy executable to $SCRATCH
cp $HOME/myhybrid .
#run my executable
mpirun -np 16 omplace -nt 4 ./myhybrid
ja -chlst

This script is identical to the above two scripts except when you run your executable. You use a combination of the mpirun and omplace commands to run your hybrid program. The value of the -np option to the mpirun command is the number of your MPI tasks. The value of the -nt option to the omplace command is the number of your OpenMP threads per MPI task. The product of these two values should be the total number of cores you requested with your PBS ncpus specification.

The omplace command insures that each of your OpenMP threads runs on its own core. You must use mpirun to run your hybrid executable or it will run on a frontend and degrade overall system performance.

Qsub command

After you create your batch script you submit it to PBS with the qsub command.

    qsub myscript.job

Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.

You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample scripts and submit the scripts with the command

    qsub -l ncpus=4 -l walltime=5:00 -j oe -q batch myscript.job

Command-line directives override directives in your scripts.

Flexible walltime requests

Two other qsub options are available for specifying your job's walltime request.

-l walltime_min=HH:MM:SS
-l walltime_max=HH:MM:SS

You can use these two options instead of "-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround in several circumstances.

For example, to accommodate large jobs, the system actively drains blades to create dynamic reservations. The blades being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill blades.

As an example, if your job requests 64 cores and a range of walltime between 2 and 4 hours and a 64-core slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.

Another situation in which specifying a flexible walltime could improve your turnaround is the period leading up to a full drain for system maintenance. The system will not start a job that will not finish before the system maintenance time begins. A job with a flexible walltime could start if the flexible walltime range overlaps the period when the maintenance time starts. A job with a fixed walltime that would not finish until after the maintenance period begins would not be started.

If the system starts one of your jobs with a flexible walltime request, the system selects a walltime within the two specified limits. This walltime will not change during your job's execution. You can determine the actual walltime your job was assigned by examining the Resource_List.walltime field of the output of the qstat -f command. The command

 qstat -f $PBS_JOBID

will give this output for the current job. You can capture this output to find the value of the Resource_List.walltime field.

You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit so that the file writing is completed when the limit is reached. The functions mpi_wtime and omp_get_wtime can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.

You may also want to save time at the end of your job to allow your job to transfer files after your program ends but before your job ends. You can use the timeout command to specify in seconds how long you want your program to run. Once your job determines what its actual walltime is you can, after subtracting the amount of time you want for file transfer at the end of your job, use this value in a timeout command. For example, assume your job is assigned a walltime of 1 hour and you want your program to stop 10 minutes before your job ends to allow your job to have adequate time for file transfer. To accomplish this you could use a command like the following

timeout --timeout=$PROGRAM_TIME -- mpirun -np 4 ./mympi

The example assumes that your script has retrieved your job's walltime, subtracted 10 minutes from it and assigned the value of 3000 to the variable PROGRAM_TIME. You will probably also want to provide this value to your program. Your program can then use this value to appropriately write out checkpoint files. When your program ends your job will have time to perform necessary file transfers before your job ends.

For more information on the timeout command see the timeout man page. If you want assistance on the procedures needed to capture your job's actual walltime or to determine when your job should write checkpoint files send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Interactive access

A form of interactive access is available on salk by using the -I option to qsub. For example, the command

    qsub -I -l ncpus=4 -l walltime=5:00 -q debug

requests interactive access to 4 cores for 5 minutes in the debug queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.

When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI or hybrid program you must use the mpirun command just as you would in batch script.

When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.

X-11 connections in interactive use

In order to use any X-11 tool, you must also include -X on the qsub command line:

qsub -X -I -l ncpus=4 -l walltime=5:00 -q debug

This assumes that the DISPLAY variable is set. Two ways in which DISPLAY is automatically set for you are:

  1. Connecting to pople with ssh -X pople.psc.edu
  2. Enabling X-11 tunneling in your Windows ssh tool

Totalview, Fluent and TAU are among the packages which require X-11 connections.

Other qsub options

Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.

-m a|b|e|n
Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
-M userlist
Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
-v variable_list
This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on pople.
-r y|n
Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
-W group_list=charge_id
Indicates to which charge_id you want a job to be charged. If you only have one grant on salk you do not need to use this option; otherwise, you should charge each job to the appropriate grant.

You can see your valid charge_ids by typing groups at the salk prompt. Typical output will look like

sy2be6n ec3l53p eb3267p jb3l60q

Your default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify -W group_list for your job, this is the grant that will be charged.

If you want to switch your default charge_id, send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

-W depend=dependency:jobid
Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution.
afterok this job can be scheduled after job jobid finishes successfully.
afternotok this job can be scheduled after job jobid finishes unsucessfully.
afterany this job can be scheduled after job jobid finishes in any state.
before this job must begin execution before job jobid can be scheduled.
beforeok this job must finish successfully before job jobid begins
beforenotok this job must finish unsuccessfully before job jobid begins
beforeany this job must finish in any state before job jobid begins

Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.

Monitoring and Killing Jobs

The qstat -a command displays the status of the PBS queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.

The qdel command is used to kill queued and running jobs. An example is the command

    qdel 54

The argument to qdel is the jobid of the job you want to kill, which you are shown when you submit your job or you can get it with the qstat command. If you cannot kill a job you want to kill send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Debugging

Debugging strategy

Your first few runs should be on a small version of your problem. You first run should not be for your largest problem size. It is easier to solve code problems if you are using fewer processors. This strategy should be followed even if you are porting a working code from another system.

You should use the debug queue for your debugging runs. Do not run a debugging run on any of salk's front ends. You should always run a salk program with qsub.

The debug queue is intended to be used in the classic debugging cycle in which you run a debugging job, check its output and then submit another debugging job. You should not flood the debug queue with jobs nor should you chain your jobs through the debug queue by having a debug job submit its sucessor.

The debug queue should not be used for production runs that use only a few processors.

Compiler options

Several compiler options can be useful to you when you are debugging your program. If you use the -g option to the Intel or GNU compilers, the error messages you receive when your program fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code you should not use the -g option or your program will run slower.

The -check bounds option to the ifort compiler will cause your program to tell you if it exceeds an array bounds while running.

Variables on salk are not automatically initialized. This can cause your program to fail if it relies on variables being initialized. The -check uninit and -ftrupuv options to the ifort compiler will catch certain cases of uninitialized variables, as will the -Wall and -O options to the GNU compilers.

There are more options to the Intel and GNU compilers that may assist you in your debugging. For more information see the appropriate man pages.

Core files

The key to making core files on salk is to allow them to be written by increasing the maximum file size allowable for core files. The default size is 0 bytes. If you are using sh-type shells you do this by issuing the command

    ulimit -c unlimited

For csh-type shells you issue the command

    limit coredumpsize unlimited

Core files are created in directory ~/tmp. For more information about core files issue the command

    man 5 core

Little endian versus big endian

The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Salk is a little endian machine, which means that the low-order byte of a number is stored in the memory location with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a big endian machine.

If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command

echo 'puts $tcl_platform(byteOrder)' | tclsh

You can read a big endian file on salk if you are using the Intel ifort compiler. Before you run your program issue the command

setenv FORT_CONVERTn big-endian

for each Fortran unit number from which you are reading a big endian file. For 'n' substitute the appropriate unit number.

Improving Performance

Calculating Mflops

You can calculate your code's Mflops rate using the TAU utility. The TAU examples show how to determine timing data and floating point operation counts for your program, from which you can calculate your Mflops rate.

Cache performance

Cache performance can have a significant impact on the performance of your program. Each salk core has three levels of cache. The primary data and instruction caches are 4-way set associative and 16 Kbytes each. The L2 data cache is 256 Kbytes, while the L2 instruction cache is 1 Mbyte. Both L2 caches are 8-way set associative. The L3 cache is 8 Mbytes.

You can measure your program's cache miss rate for each of the available caches by setting the appropriate counters when using the TAU utility. If you need assistance in measuring or improving your cache performance send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Collecting timing data

Collecting timing data is essential for measuring and improving program performance. We recommend five approaches for collecting timing data. The ja and /usr/bin/time utilities can be used to collect data at the program level. They report results to the hundredths of seconds. The TAU utility and the omp_get_wtime and MPI_Wtime functions can be used to collect timing data at a finer grain. The default precision for TAU is microseconds, but the -linuxtimers or -papi option can be used to obtain nanosecond precision. The precision for omp_get_wtime is microseconds, while the precision for MPI_Wtime is nanoseconds.

IO optimization

There are several steps you can take to improve your application IO performance on salk. If your program reads or writes data in small chunks you can use the Flexible File IO subsystem. If your program reads or writes files that are 1 Gbyte or larger, you can use file striping so that your file IO will be done in parallel.

Flexible File IO

The Flexible File IO Subsystem adds a layer of buffering to your program. The result will be that your program will have fewer IO operations and the IO operations will read and write larger volumes of data. This should speed up your program.

You do not need to change your source code to use Flexible File IO nor do you need to recompile or relink your program. Instead, there are environment variables that you need to set before you run your executable. The use of these environment variables and other information about Flexible File IO is contained in the "Linux Application Tuning Guide" manual available at

  http://techpubs.sgi.com/

The use of Flexible File IO is heavily dependent on the characteristics of your application so we do not have any general advice for its use. If you want assistance using Flexible File IO with your program send email to This email address is being protected from spambots. You need JavaScript enabled to view it.

File striping

If your program reads or write large files you should use $SCRATCH. Your $HOME space is limited. In addition, the $SCRATCH file space is implemented using the Lustre parallel file system. A program that uses $SCRATCH can perform parallel IO and this can significantly improve its performance.

A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. Your goal should be to use as many OSTs as possible concurrently. This is how you can use Lustre as a parallel file system. Salk currently has 32 OSTs.

A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.

For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.

An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.

However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as ls -l or unlink will take longer for striped files. Also, striped files are more at risk for data loss due to hardware failure. If a file is spread across several OSTs a hardware failure of any of them will result in the loss of part of the data in that file. You may choose to lose all of a small number of files rather than parts of all of a large number of your files.

You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.

The format of the lfs setstripe command is

lfs setstripe filename stripe-size OST-start stripe-count

We recommend that you always set the stripe size parameter to 0 and the starting OST parameter to -1. This will result in the default stripe size of 1 Mbyte and assign your starting OST in a round-robin fashion. A value of -1 for the stripe count means the file should be spread across all the available OSTs. Since the Lustre file system on salk currently has 32 OSTs you could also specify the stripe size paramater as 32 to have the file spread across all available OSTs.

For example, the command

lfs setstripe bigfile.out 0 -1 -1

sets the stripe count for bigfile.out to be all available OSTs. This command would be suitable for the situation where you have one core which is writing all your data.

The command

lfs setstripe manyfiles.out 0 -1 1

has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.

You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.

The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application.

There is a man page for lfs on salk. Online documentation for Lustre is also available. If you want assistance with what striping strategy to follow send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Third-party software

Third-party routines can often perform better than routines you code yourself. You should investigate whether there is a third-party routine available to replace any of the routines you have written yourself.

For examples, we recommend the FFTW library for FFTs. For linear algebra routines we recommend the MKL library.

Performance monitoring tools

We have installed several performance monitoring tools on salk. The TAU utility is a performance profiling and tracing tool. The PAPI utility can be used to access the hardware performance counters on salk. We intend to install more performance tools on salk. If you want assistance in using any of these tools or have a utility you would like us to install send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

Assistance with improving performance

If you would like to improve the performance of your code, you can get optimization assistance from PSC. This assistance includes consulting assistance from PSC, special queue handling if necessary, and service unit discounts, all of which are designed to enable you to scale up your code as quickly as possible. Send email to This email address is being protected from spambots. You need JavaScript enabled to view it. if you would like performance improvement assistance with your program.

Software Packages

A list of software packages installed on salk is available. If you would like us to install a package that is not in this list send email to This email address is being protected from spambots. You need JavaScript enabled to view it. .

The Module Command

To run many software packages paths and other environment variables must often be set. To change versions of a package these definitions often have to be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see the module documentation.

Stay Informed

As a user of salk, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently.

You will also periodically receive email from PSC with information about salk. In order to insure that you receive this email, you should make sure that your email forwarding is set properly by following the instructions for setting your email forwarding.

Acknowledgement in Publications

PSC requests that a copy of any publication (preprint or reprint) resulting from research done on salk be sent to the PSC Allocations Coordinator We also request that you include an acknowledgement of PSC in your publication.

Reporting a Problem

You have two options for reporting problems on salk.

You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.