- System Configuration
- Access to BioU
- Storing Files
- Transferring files
- Creating Programs
- Running Jobs
- Monitoring and Killing Jobs
- Software Packages
- The Module Command
- Stay Informed
- Acknowledgement in Publications
- Applicability of NIH's Public Access Policy
- Reporting a Problem
BioU is a bioinformatics educational resource funded by the NIH. It provides a stable environment in which classroom and individualized research training can occur. Small research projects, such as individualized class projects, graduate student projects, and many typical academic bioinformatics projects can be hosted on BioU. However, projects requiring significant computational resources should be carried out on other, larger, PSC/NRBSC computing platforms.
BioU is a 3-node computational cluster. Each node has 4 quad-core AMD Opteron processors (2.4 GHz), for a total of 16 cores per node. Each node has 128 Gbytes of memory. The nodes are interconnected by a QDR InfiniBand communications link.
BioU runs the Redhat Linux operating system.
GNU and Intel C, C++ and Fortran compilers are installed on BioU, as are the facilities to enable you to run MPI programs. Python and the BioPython libraries are also installed.
Access to BioU
Getting an account on BioU
The primary purpose of BioU is to provide a stable environment for classroom and individualized bioinformatics research training. Small research projects such as individualized class projects, graduate student projects, and many typical academic bioinformatics projects can be hosted on BioU.
To apply for a grant visit http://www.nrbsc.org/resources/.
Connecting to BioU
For ease of access, BioU runs a web server. Most users do not log in through a terminal, but rather access BioU through one of three web interfaces:
- Moodle: A learning management system at http://biou.psc.edu
- Galaxy: An interface to bioinformatics programs at http://biou.psc.edu/galaxy
- Harvest-seq: An interface to bioinformatics programs at http://biou.psc.edu/harvest
Use your PSC Kereberos username and password to authenticate to all of these web interfaces when prompted.
Use ssh to connect to biou.psc.edu. When prompted, enter your PSC Kerberos username and password.
Changing your password
Your password is the same on all PSC production platforms. If you change your password on one PSC system, it is changed on all PSC systems.
Please read the PSC password policies.
Changing your password on the web
Visit https://apr.psc.edu to change your password online. The site contains detailed instructions.
Changing your password while logged in to BioU
Use the kpasswd command to change your PSC Kerberos password. Do not use the passwd command.
Changing your login shell
You can use the chsh command to change your login shell. When doing so, specify a shell from the /usr/psc/shells directory.
Accounting on BioU
One core-hour on BioU is one SU.
User accounting data is available with the xbanner command. Account information including the initial SU allocation for a grant, the number of unused SUs remaining for a grant and the date of the last job that charged to a grant are displayed.
Accounting information for grants is also available at the Web-based PSC Grant Management System. You will need your PSC Kerberos password to access this system. This system provides more detailed information than xbanner, although some of the information is only available to grant PIs. The system has extensive internal documentation.
File systems are file storage spaces directly connected to a system. These are the areas available to you on BioU:
This is your home directory. Your $HOME directory has a 5-Gbyte quota. $HOME is visible to all of BioU's compute nodes. $HOME is backed up daily, although it is still a good idea to store your important $HOME files to the Data Supercell (patent pending). The Data Supercell, PSC's file archival system, is discussed below.
This is BioU's scratch area to be used as a working space for your running jobs. The $SCRATCH for each node is a distinct file space and each separate space has 7.3 Tbytes of available storage. You should use the name $SCRATCH to refer to your scratch area since we may change its implementation.
$SCRATCH is not a permanent storage space. Files can only remain on $SCRATCH for up to 7 days and then we will delete them. In addition, we will delete $SCRATCH files if we need to free up space to keep jobs running. Finally, $SCRATCH is not backed up. For these three reasons, you should store copies of your $SCRATCH files to your local site or to the Data Supercell as soon as you can after you create them. The Data Supercell, PSC's file archival system, is discussed below.
- Galaxy History Files
- The Galaxy interface stores files that are included in a Galaxy history in a unique internal structure. Galaxy histories should be considered as temporary storage locations, not a permanent storage repository.
Guidelines for using Galaxy History files:
- You should always download (save) the datasets that you wish to keep that are contained within a Galaxy history. Datasets may be stored in the PSC's Data Supercell (patent pending), the file archival system.
- As a general guideline, datasets contained in Galaxy Histories that have not been accessed within the last six months may be purged from the system whenever additional storage space is needed.
- Extremely large dataset files can cause system issues and may be purged at any time. An attempt will be made to contact the user before deletion and/or archiving of the large dataset. BioU is designed to be a teaching resource. Extremely large datasets should be analysed on other, more powerful, PSC computing resources.
- Moodle Course Files
- The Moodle learning management system provides a unique internal structure in which course files are uploaded and managed by the Moodle system.
We recommend that instructors create personal backup copies of their Moodle courses periodically. We also recommend that the instructor create a final backup of the course and download the backup for safekeeping at the conclusion of each course. Backups for a course can only be performed by individuals assigned the "Teacher" role. Course backups may be stored in the PSC's file archival system. To backup a Moodle course:
- Log into the Moodle system and go to the course that you want to back up or archive.
- Under the the "Administration" box, select "Backup".
- Confirm that all items that you want to backup are selected.
- Click on the "Continue" button at the bottom of the screen.
- Accept the default backup filename, or change it if necessary.
- Click on the "Continue" button at the bottom of the screen. Your backup should now be complete.
- Click on the "Continue" button at the bottom of the screen.
- Click on the backup file to download it to a local machine.
Courses in the Moodle system that have not been accessed within the last year may be removed from the system.
File repositories are file storage spaces which are not directly connected to a frontend or compute processor. You cannot, for example, open a file that resides in a file repository. You must use explicit file copy commands to move files to and from a repository. You currently have one file repository available to you on BioU: the Data Supercell, PSC's file archival system.
- The Data Supercell (patent pending)
The Data Supercell is a complex disk-based system.
GNU and Intel C, C++ and Fortran compilers are installed on BioU and they can be used to create serial and MPI programs. OpenMPI is the variant of MPI available on BioU. The commands you should use to create your programs are shown in the table below.
|GNU Fortran||mpif90 mympi.f90||gfortran myserial.f90|
|GNU C||mpicc mympi.c||gcc myserial.c|
|GNU C++||mpic++ mympi.C||g++ myserial.C|
|Intel Fortran||ifort mympi.f -lmpi||ifort myserial.f|
|Intel C||icc mympi.c -lmpi||icc myserial.c|
|Intel C++||icpc mympi.cc -lmpi -lmpi++||icpc myserial.cc|
Man pages for the GNU and Intel compilers are available.
The Python interpreter and the BioPython libraries are also installed. To use Python and BioPython, add the following as the first line in your Python script file:
The Portable Batch System (PBS) controls all access to BioU's compute nodes, for both parallel and serial jobs in both batch and interactive mode. Currently BioU has two batch queues: the parallel queue and the serial queue. Interactive jobs can run in either queue. The method for doing so is discussed below.
- Twenty-eight cores are used by the serial queue. There are no memory limits in the serial queue, other than the physical memory capacity of each machine (<128Gb).
- Sixteen cores are used by the parallel queue. There are no memory limits in the parallel queue, other than the physical memory capacity of each machine (<128Gb).
Both queues are FIFO queues. Both queues allow multiple jobs to run on the machine at one time.
Galaxy PBS jobs
Many analysis jobs run through the Galaxy interface are submitted to the PBS queues on BioU. Most Galaxy PBS jobs are submitted to the serial queue with a time limit of 48 hours. Other Galaxy PBS jobs are submitted to the parallel queue as 16 core jobs with a time limit of 48 hours. These limits are not adjustable by Galaxy users through the Galaxy system.
To view the status of the PBS queues on BioU through Galaxy, run the Galaxy tool "Check PBS Queues" listed under "BioU: Status check".
Sample MPI batch job
To run a batch job on BioU you submit a batch script to the scheduler. A job script consists of PBS directives, comments and executable commands. The last line of your batch script must end with a newline.
A sample job script to run an MPI program is
#!/bin/csh #PBS q parallel #PBS -l nodes=1:ppn=16 #PBS -l walltime=5:00 #PBS -j oe set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/mympi . #run my executable mpirun ./mympi
The first line in the script cannot be a PBS directive. Any PBS directive in the first line is ignored. Here, the first line identifies which shell should be used for your batch job. If instead of the C-shell you are using the Bourne shell or one of its descendants and you are using the module command in your batch script, then you must include the -l option to your shell command.
The next four lines are PBS directives.
- #PBS -q parallel
This directive directs your job to the parallel queue. Your job will run on either biou2 or biou3. In either case biou3 will be your home node and when you cd to $SCRATCH you will be using biou3's $SCRATCH.
- #PBS -l nodes=1:ppn=16
This directive indicates how many cores you want to use. The value for nodes must be 1. The value for ppn is the number of cores you want to use. The maximum value of ppn is 16.
- #PBS -l walltime=5:00
The second directive requests 5 minutes of walltime. Specify the time in the format HH:MM:SS. At most two digits can be used for minutes and seconds. Do not use leading zeroes in your walltime specification.
- #PBS -j oe
The next directive combines your .o and .e output into one file, in this case your .o file. This makes your job easier to debug.
The remaining lines in the script are comments and command lines.
- set echo
This command causes your batch output to display each command next to its corresponding output. This makes your job easier to debug. If you are using the Bourne shell or one of its descendants use
- Comment lines
The other lines in the sample script that begin with '#' are comment lines. The '#' for comments and PBS directives must be in column one of your scripts.
- mpirun ./mympi
This command launches your executable on BioU's compute nodes. You must use mpirun to run your MPI executable.
Sample serial batch job
A sample job script to run a serial program is
#!/bin/csh #PBS -l nodes=1:ppn=1 #PBS -l walltime=5:00 #PBS -j oe #PBS -q serial set echo #move to my $SCRATCH directory cd $SCRATCH #copy executable to $SCRATCH cp $HOME/myserial . #run my executable ./myserial
Your serial program will run on one core, but it can use memory up to the physical memory limit of the biou1 node. It will have to share this memory with other serial jobs running on the machine while it is running.
After you create your batch script you submit it to PBS with the qsub command.
Your batch output--your .o and .e files--is returned to the directory from which you issued the qsub command after your job finishes.
You can also specify PBS directives as command-line options. Thus, you could omit the PBS directives from the above sample script for a parallel job and submit the script with the command
qsub -l nodes=1:ppn=16 -l walltime=5:00 -j oe -q parallel myscript.job
Command-line directives override directives in your scripts.
A form of interactive access is available on BioU by using the -I option to qsub. For example, the command
qsub -I -q serial -l nodes=1:ppn=1 -l walltime=5:00
requests interactive access to 1 core for 5 minutes in the serial queue. Your qsub -I request will wait until it can be satisfied. If you want to cancel your request you should type ^C.
When you get your shell prompt back your interactive job is ready to start. At this point any commands you enter will be run as if you had entered them in a batch script. Stdin, stdout, and stderr are connected to your terminal. To run an MPI program you must use the mpirun command just as you would in a batch script.
When you finish your interactive session type ^D. When you use qsub -I you are charged for the entire time you hold your processors whether you are computing or not. Thus, as soon as you are done executing commands you should type ^D.
Other qsub options
Besides those options mentioned above, there are several other options to qsub that may be useful. See man qsub for a complete list.
- -m a|b|e|n
- Defines the conditions under which a mail message will be sent about a job. If "a", mail is sent when the job is aborted by the system. If "b", mail is sent when the job begins execution. If "e", mail is sent when the job ends. If "n",no mail is sent. This is the default.
- -M userlist
- Specifies the users to receive mail about the job. Userlist is a comma-separated list of email addresses. If omitted, it defaults to the user submitting the job.
- -v variable_list
- This option exports those environment variables named in the variable_list to the environment of your batch job. The -V option, which exports all your environment variables, has been disabled on BioU.
- -r y|n
- Indicates whether or not a job should be automatically restarted if it fails due to a system problem. The default is to not restart the job. Note that a job which fails because of a problem in the job itself will not be restarted.
- -W group_list=charge_id
- Indicates to which charge_id you want a job to be charged. If you only have one grant on BioU you do not need to use this option; otherwise, you should charge each job to the appropriate grant.
You can see your valid charge_ids by typing
groupsat the BioU prompt. Typical output will look like
sy2be6n ec3l53p eb3267p jb3l60q
Your default charge_id is the first group in the list; in this example "sy2be6n". If you do not specify
-W group_listfor your job, this is the grant that will be charged.
- -W depend=dependency:jobid
- Specifies how the execution of this job depends on the status of other jobs. Some values for dependencyare:
after this job can be scheduled after job jobid begins execution. afterok this job can be scheduled after job jobid finishes successfully. afternotok this job can be scheduled after job jobid finishes unsucessfully. afterany this job can be scheduled after job jobid finishes in any state. before this job must begin execution before job jobid can be scheduled. beforeok this job must finish successfully before job jobid begins beforenotok this job must finish unsuccessfully before job jobid begins beforeany this job must finish in any state before job jobid begins
Specifying "before" dependencies requires that job jobid be submitted with -W depend=on:count. See the man page for details on this and other dependencies.
Monitoring and Killing Jobs
The qstat -a command displays the status of the queues. It shows running and queued jobs. For each job it shows the amount of walltime and the number of cores and processors requested. For running jobs it shows the amount of walltime the job has already used. The qstat -f command, which takes a jobid as an argument, provides more extensive information for a single job.
The qdel command is used to kill queued and running jobs. An example is the command
A complete list of software packages installed on BioU will be available soon.
You can run software either from the command line or via the Galaxy interface.
Running bioinformatics software from the command line
Most bioinformatics software is installed in the directory /packages/bin, with the exception of major bioinformatics software suites (including Phylip, Fasta,Hmmer3, and Meme). These suites are installed in subdirectories of /packages/bin. In addition, many of the compiled executables from codon have been placed in the directory /packages/biomed-codon/bin.
To access these programs and software suites, include the following directories in your BioU path:
/packages/bin /packages/bin/phylip /packages/bin/MPIphylip /packages/bin/fasta /packages/bin/hmmer3 /packages/bin/meme/bin /packages/biomed-codon/bin
As a convenience, csh users may wish to add the following line to their .login file. This will set up the path variable appropriately, and also define a few environment variables required by the programs:
Running bioinformatics software using Galaxy
No special setup is needed to run bioinformatics software using Galaxy. Simply log into Galaxy at https://biou.psc.edu/galaxy.
The Module Command
To run many software packages paths and other variables must often first be set. To change versions of a package these definitions must often be modified. The module command makes this process easier. For use of the module command, including its use in batch jobs, see
As a user of BioU, it is imperative that you stay informed of changes to the machine's environment. Refer to this document frequently. In addition, important system information is posted to the PSC's Web page of bboard posts.
You will also periodically receive email from PSC with information about BioU. In order to insure that you receive this email, you should make sure your email forwarding is set properly by following the instructions for setting your email forwarding.
Acknowledgement in Publications
PSC requests that a copy of any publication (preprint or reprint) resulting from research done on BioU be sent to the PSC Allocations Coordinator. We also request that you include an acknowledgement of PSC in your publication.
Applicability of NIH's Public Access Policy
All users should be aware that the NIH Public Access Policy, which requires that authors of published peer-reviewed manuscripts deposit those manuscripts in PubMed Central within 12 months of publication, applies to publications arising from projects using BioU. For more information on the NIH Public Access Policy see http://publicaccess.nih.gov.
Reporting a Problem
You have two options for reporting problems on BioU.
- You can call the User Services Hotline at 412-268-6350 from 9:00 a.m. until 5:00 p.m., Eastern time, Monday through Friday.