Salk was installed in 2008 and decommissioned in January, 2015.to have it reset.
Changing your login shell
You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.
Accounting on salk
One core-hour on salk is one SU.
User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.
Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to grants PIs. The system has extensive internal documentation.
File systems are file storage spaces directly connected to a system. There are currently two such areas available to you on salk.
This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of salk's compute and frontend processors. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to the Data Supercell*. The Data Supercell, PSC's file archival system, is discussed below.
This is salk's scratch area to be used as a working space for your running jobs. $SCRATCH is visible to all of salk's compute and frontend processors. You should use the name $SCRATCH to refer to your scratch area since we may change its implementation.
$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 7 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to the Data Supercell as soon as you can after you create them. The Data Supercell (patent pending), PSC's file archival system, is discussed below.
File repositories are discrete file storage spaces. You cannot, for example, open a file that resides in a file repository nor will you run a program on a file repository. You will not login to file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on Salk: the Data Supercell, PSC's file archival system.
- The Data Supercell (patent pending)
- The Data Supercell is a disk-based archival system.
The Intel C, C++ and Fortran compilers and the Gnu C and C++ compilers are installed on salk and they can be used to create OpenMP, MPI, hybrid and serial programs. The commands you should use to create each of these types of programs are shown in the table below.
|Intel Fortran||ifort -openmp myopenmp.f||ifort mympi.f -lmpi||ifort -openmp myhybrid.f -lmpi||ifort myserial.f|
|Intel C||icc -openmp myopenmp.c||icc mympi.c -lmpi||icc -openmp myhybrid.c -lmpi||icc myserial.c|
|Intel C++||icpc -openmp myopenmp.cc||icpc mympi.cc -lmpi -lmpi++||icpc -openmp myhyrid.cc -lpmi -lmpi++||icpc myserial.cc|
|Gnu C||gcc -openmp myopenmp.c||gcc mympi.c -lmpi||gcc -openmp myhybrid.c -lmpi||gcc myserial.c|
|Gnu C++||g++ -openmp myopenmp.cc||g++ mympi.cc -lmpi -lmpi++||g++ -openmp myhybrid.cc -limpi -lmpi++||g++ myserial.cc|
Man pages are available for ifort, icc and icpc and for gcc and g++.
The UPC compiler is also installed on salk. Online instructions for its use are available.
A native Java compiler and interpreter are available on salk. Issue the command
module load jrockit/5.0
to get access to the javac and java commands.
Torque, an open source version of the Portable Batch Scheduler (PBS), controls all access to salk's compute processors, for both batch and interactive jobs. Currently salk has two queues: the batch queue and the debug queue. Interactive jobs can run in the batch queue and the debug queue and the method for doing so is discussed below.
The maximum walltime for the batch queue is 48 hours and the maximum number of cores you can request is 132. The maximum walltime for the debug queue is 30 minutes and the maximum number of cores you can request is 8.
Sample batch jobs
To run a batch job on salk you submit a batch script to the PBS system. A PBS job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.
A sample job script to run an OpenMP program is
#!/bin/csh #PBS -l ncpus=4 #ncpus must be a multiple of 4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myopenmp . #run my executable setenv OMP_NUM_THREADS 4 ./myopenmp ja -chlst
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command, then you must use the -l option to your shell command.
The four #PBS lines are PBS directives.
- #PBS -l ncpus=4
This directive specifies the number of cores to allocate for the job. For performance reasons the actual allocation of cores is done by blades, with each blade containing four cores. You must request cores in multiples of four. Jobs do not share blades.
The value of ncpus is the number of cores to allocate for the job. The number of cores must be a multiple of four, or the job submission will fail. Within your batch script the environment variable PBS_NCPUS is set to the number of cores you requested.
Each blade has 8 Gbytes of physical memory. If your job exceeds the amount of physical memory available to it--a job requesting 16 cores will run on 4 blades and thus have 32 Gbytes of memory available to it--it will be killed by the system with a message similar to
PBS: Job killed: cpuset memory_pressure X reached/exceeded limit 1
written to its stderr. A cpuset is the set of blades--cores and associated memory--assigned to your job. Memory pressure is a metric that indicates whether the processes on a blade are attempting to free up in use memory on the blade to satisfy additional memory requests. Since this use of memory would result in significantly lower performance, a job that attempts to do this is killed by the system. For more information about cpusets and memory pressure see the man page man 4 cpuset.
If this happens to your job you should resubmit it and ask for more cores. The output from the ja command, which is discussed below, can help you determine how many blades your job needs.
- #PBS -l walltime=5:00
The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.
Your stdout and stderr files are each limited to 20 Mbytes. If your job exceeds either of these limits it will be killed by the system. If you have a program that you think will exceed either of these limits you should redirect your stdout or stderr output to a $SCRATCH file. Another option is to run your job from $SCRATCH.
- #PBS -q batch
The final PBS directive requests that your job be run in the batch queue.
The remaining lines in the script are comments and command lines.
- set echo
This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use
The ja command turns on job accounting for your job. This allows you to obtain information on the elpased time and memory and IO usage of your program, plus other data.
You must pair the command with another ja command at the end of your job. The option -t to this second ja command turns off job accounting and writes your accounting data to stdout. The other options to the second example ja command determine what output you will receive from ja. We recommend these options because we think they will provide detailed but useful information about your job's processes. However, you can look at the man page for ja to see what reporting options you want to use.
There is no overhead to using ja. We strongly recommend that you use ja so you can understand the resource usage of your jobs, which you can use when you submit future jobs. The output from ja can also be used for debugging and performance improvement purposes.
- Comment lines
The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.
- setenv OMP_NUM_THREADS 4
- This command sets the number of threads you want your OpenMP program to use. You should set this value to the number of cores you requested with your PBS nodes directive so each of your threads will run on its own core.
- This command runs your executable.
A sample job to run an MPI program is
#!/bin/csh #PBS -l ncpus=4 #ncpus must be a multiple of 4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #run my executable mpirun -np 4 ./mympi ja -chlst
This script is identical to the OpenMP script except when you run your executable. You do not have to set the variable OMP_NUM_THREADS, but you have to use the mpirun command to launch your executable on salk's compute processors. The value for the -np option is the number of cores you want your program to run on. You should set -np to the number of cores you requested with your PBS ncpus directive. You must use mpirun to run your MPI executable or it will run on a frontend and degrade overall system performance.
A sample job to run a hybrid OpenMP and MPI program is
#!/bin/csh #PBS -l ncpus=64 #ncpus must be a multiple of 4 #PBS -l walltime=5:00 #PBS -j oe #PBS -q batch set echo ja #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myhybrid . #run my executable mpirun -np 16 omplace -nt 4 ./myhybrid ja -chlst
This script is identical to the above two scripts except when you run your executable. You use a combination of the mpirun and omplace commands to run your hybrid program. The value of the -np option to the mpirun command is the number of your MPI tasks. The value of the -nt option to the omplace command is the number of your OpenMP threads per MPI task. The product of these two values should be the total number of cores you requested with your PBS ncpus specification.
The omplace command insures that each of your OpenMP threads runs on its own core. You must use mpirun to run your hybrid executable or it will run on a frontend and degrade overall system performance.
After you create your batch script you submit it to PBS with the qsub command.
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample scripts and submit the scripts with the command
qsub -l ncpus=4 -l walltime=5:00 -j oe -q batch myscript.job
Command-line directives override directives in your scripts.
Flexible walltime requests
Two other qsub options are available for specifying your job's walltime request.
-l walltime_min=HH:MM:SS -l walltime_max=HH:MM:SS
You can use these two options instead of "-l walltime" to make your walltime request flexible or malleable. A flexible walltime request can improve your job's turnaround in several circumstances.
For example, to accommodate large jobs, the system actively drains blades to create dynamic reservations. The blades being drained for these reservations create backfill up to the reservation start time that may be used by other jobs. Using flexible walltime limits increases the opportunity for your job to run on backfill blades.
As an example, if your job requests 64 cores and a range of walltime between 2 and 4 hours and a 64-core slot is available for 3 hours, your job could run in this slot with a walltime request of 3 hours. If your job had asked for a fixed walltime request of 4 hours it would not be started.
Another situation in which specifying a flexible walltime could improve your turnaround is the period leading up to a full drain for system maintenance. The system will not start a job that will not finish before the system maintenance time begins. A job with a flexible walltime could start if the flexible walltime range overlaps the period when the maintenance time starts. A job with a fixed walltime that would not finish until after the maintenance period begins would not be started.
If the system starts one of your jobs with a flexible walltime request, the system selects a walltime within the two specified limits. This walltime will not change during your job's execution. You can determine the actual walltime your job was assigned by examining the Resource_List.walltime field of the output of the qstat -f command. The command
qstat -f $PBS_JOBID
will give this output for the current job. You can capture this output to find the value of the Resource_List.walltime field.
You may need to provide this value to your program so that your program can make appropriate decisions about writing checkpoint files. In the above example, you would tell your program that it is running for 3 hours and thus should begin writing checkpoint files sufficiently in advance of the 3-hour limit so that the file writing is completed when the limit is reached. The functions mpi_wtime and omp_get_wtime can be used to track how long your program has been running so that it writes checkpoint files to make sure you save results from your program's processing.
You may also want to save time at the end of your job to allow your job to transfer files after your program ends but before your job ends. You can use the timeout command to specify in seconds how long you want your program to run. Once your job determines what its actual walltime is you can, after subtracting the amount of time you want for file transfer at the end of your job, use this value in a timeout command. For example, assume your job is assigned a walltime of 1 hour and you want your program to stop 10 minutes before your job ends to allow your job to have adequate time for file transfer. To accomplish this you could use a command like the following
timeout --timeout=$PROGRAM_TIME -- mpirun -np 4 ./mympi
The example assumes that your script has retrieved your job's walltime, subtracted 10 minutes from it and assigned the value of 3000 to the variable PROGRAM_TIME. You will probably also want to provide this value to your program. Your program can then use this value to appropriately write out checkpoint files. When your program ends your job will have time to perform necessary file transfers before your job ends.
A form of interactive access is available on salk by using the -I option to qsub. For example, the command
qsub -I -l ncpus=4 -l walltime=5:00 -q debug
requests interactive access to 4 cores for 5 minutes in the debug queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI or hybrid program you must use the mpirun command just as you would in batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
X-11 connections in interactive use
In order to use any X-11 tool, you must also include -X on the qsub command line:
qsub -X -I -l ncpus=4 -l walltime=5:00 -q debug
This assumes that the DISPLAY variable is set. Two ways in which DISPLAY is automatically set for you are:
- Connecting to pople with ssh -X pople.psc.edu
- Enabling X-11 tunneling in your Windows ssh tool
Totalview, Fluent and TAU are among the packages which require X-11 connections.
Other qsub options
Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on pople.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- Indicates to which charge_id you want a job to be charged. If you only have one grant on salk you do not need to use this option; otherwise, you should charge each job to the appropriate grant.
You can see your valid charge_ids by typing
groupsat the salk prompt. Typical output will look like
sy2be6n ec3l53p eb3267p jb3l60q
Your default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify
-W group_listfor your job, this is the grant that will be charged.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins
Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Monitoring and Killing Jobs
The qstat -a command displays the status of the PBS queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.
The qdel command is used to kill queued and running jobs. An example is the command
Your first few runs should be on a small version of your problem. You first run should not be for your largest problem size. It is easier to solve code problems if you are using fewer processors. This strategy should be followed even if you are porting a working code from another system.
You should use the debug queue for your debugging runs. Do not run a debugging run on any of salk's front ends. You should always run a salk program with qsub.
The debug queue is intended to be used in the classic debugging cycle in which you run a debugging job, check its output and then submit another debugging job. You should not flood the debug queue with jobs nor should you chain your jobs through the debug queue by having a debug job submit its sucessor.
The debug queue should not be used for production runs that use only a few processors.
Several compiler options can be useful to you when you are debugging your program. If you use the -g option to the Intel or GNU compilers, the error messages you receive when your program fails will probably be more informative. For example, you will probably be given the line number of the source code statement that caused the failure. Once you have a production version of your code you should not use the -g option or your program will run slower.
The -check bounds option to the ifort compiler will cause your program to tell you if it exceeds an array bounds while running.
Variables on salk are not automatically initialized. This can cause your program to fail if it relies on variables being initialized. The -check uninit and -ftrupuv options to the ifort compiler will catch certain cases of uninitialized variables, as will the -Wall and -O options to the GNU compilers.
There are more options to the Intel and GNU compilers that may assist you in your debugging. For more information see the appropriate man pages.
The key to making core files on salk is to allow them to be written by increasing the maximum file size allowable for core files. The default size is 0 bytes. If you are using sh-type shells you do this by issuing the command
ulimit -c unlimited
For csh-type shells you issue the command
limit coredumpsize unlimited
Core files are created in directory ~/tmp. For more information about core files issue the command
man 5 core
Little endian versus big endian
The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Salk is a little endian machine, which means that the low-order byte of a number is stored in the memory location with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a big endian machine.
If your machine has Tcl installed you can tell whether the machine is little endian or big endian by issuing the command
echo 'puts $tcl_platform(byteOrder)' | tclsh
You can read a big endian file on salk if you are using the Intel ifort compiler. Before you run your program issue the command
setenv FORT_CONVERTn big-endian
for each Fortran unit number from which you are reading a big endian file. For 'n' substitute the appropriate unit number.
You can calculate your code's Mflops rate using the TAU utility. The TAU examples show how to determine timing data and floating point operation counts for your program, from which you can calculate your Mflops rate.
Cache performance can have a significant impact on the performance of your program. Each salk core has three levels of cache. The primary data and instruction caches are 4-way set associative and 16 Kbytes each. The L2 data cache is 256 Kbytes, while the L2 instruction cache is 1 Mbyte. Both L2 caches are 8-way set associative. The L3 cache is 8 Mbytes.
Collecting timing data
Collecting timing data is essential for measuring and improving program performance. We recommend five approaches for collecting timing data. The ja and /usr/bin/time utilities can be used to collect data at the program level. They report results to the hundredths of seconds. The TAU utility and the omp_get_wtime and MPI_Wtime functions can be used to collect timing data at a finer grain. The default precision for TAU is microseconds, but the -linuxtimers or -papi option can be used to obtain nanosecond precision. The precision for omp_get_wtime is microseconds, while the precision for MPI_Wtime is nanoseconds.
There are several steps you can take to improve your application IO performance on salk. If your program reads or writes data in small chunks you can use the Flexible File IO subsystem. If your program reads or writes files that are 1 Gbyte or larger, you can use file striping so that your file IO will be done in parallel.
Flexible File IO
The Flexible File IO Subsystem adds a layer of buffering to your program. The result will be that your program will have fewer IO operations and the IO operations will read and write larger volumes of data. This should speed up your program.
You do not need to change your source code to use Flexible File IO nor do you need to recompile or relink your program. Instead, there are environment variables that you need to set before you run your executable. The use of these environment variables and other information about Flexible File IO is contained in the "Linux Application Tuning Guide" manual available at
If your program reads or write large files you should use $SCRATCH. Your $HOME space is limited. In addition, the $SCRATCH file space is implemented using the Lustre parallel file system. A program that uses $SCRATCH can perform parallel IO and this can significantly improve its performance.
A Lustre file system is created from an underlying set of file systems called Object Storage Targets (OSTs). Your program can read from and write to multiple OSTs concurrently. Your goal should be to use as many OSTs as possible concurrently. This is how you can use Lustre as a parallel file system. Salk currently has 32 OSTs.
A striped file is one that is spread across multiple OSTs. Thus, striping a file is one way for you to be able to use multiple OSTs concurrently. However, striping is not suitable for all files. Whether it is appropriate for a file depends on the IO structure of your program.
For example, if each of your cores writes to its own file you should not stripe these files. If each file is placed on its own OST then as each core writes to its own file you will achieve a concurrent use of the OSTs because of the IO structure of your program. File striping in this case could actually lead to an IO performance degradation because of the contention between the cores as they perform IO to the pieces of their files spread across the OSTs.
An application ideally suited to file striping would be one in which there is a large volume of IO but a single core performs all the IO. In this situation you will need to use striping to be able to use multiple OSTs concurrently.
However, there are other disadvantages besides possible IO contention to striping and these must be considered when making your striping decisions. Many interactive file commands such as ls -l or unlink will take longer for striped files. Also, striped files are more at risk for data loss due to hardware failure. If a file is spread across several OSTs a hardware failure of any of them will result in the loss of part of the data in that file. You may choose to lose all of a small number of files rather than parts of all of a large number of your files.
You use the lfs setstripe command to set the striping parameters for a file. You have to set the striping parameters for a file before you create it.
The format of the lfs setstripe command is
lfs setstripe filename stripe-size OST-start stripe-count
We recommend that you always set the stripe size parameter to 0 and the starting OST parameter to -1. This will result in the default stripe size of 1 Mbyte and assign your starting OST in a round-robin fashion. A value of -1 for the stripe count means the file should be spread across all the available OSTs. Since the Lustre file system on salk currently has 32 OSTs you could also specify the stripe size paramater as 32 to have the file spread across all available OSTs.
For example, the command
lfs setstripe bigfile.out 0 -1 -1
sets the stripe count for bigfile.out to be all available OSTs. This command would be suitable for the situation where you have one core which is writing all your data.
lfs setstripe manyfiles.out 0 -1 1
has a stripe count of 1. Each file will be placed on its own OST. This is suitable for the situation where each core writes its own file and you do not want to stripe these files.
You can also specify a directory instead of a filename in the lfs setstripe command. The result will be that each file created in that directory will have the indicated striping. You can override this striping by issuing an lfs setstripe command for individual files within that directory.
The kind of striping that is best for your files is very application dependent. Your application will probably fall between the two extreme cases discussed above. You will therefore need to experiment with several approaches to see which is best for your application.
Third-party routines can often perform better than routines you code yourself. You should investigate whether there is a third-party routine available to replace any of the routines you have written yourself.
Performance monitoring tools
Assistance with improving performance
The Module Command
To run many software packages paths and other environment variables must often be set. To change versions of a package these definitions often have to be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see the module documentation.
As a user of salk, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently.
You will also periodically receive email from PSC with information about salk. In order to insure that you receive this email, you should make sure that your email forwarding is set properly by following the instructions for setting your email forwarding.
Acknowledgement in Publications
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on salk be sent to the PSC Allocations Coordinator We also request that you include an acknowledgement of PSC in your publication.
Reporting a Problem
You have two options for reporting problems on salk.
You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.-->