Connecting to Bridges

 

Before the first time you connect to Bridges, you must create your PSC password.  Depending on your preferences, you may want to change your login shell once you are logged in. 

We take security very seriously! Be sure to read and comply with PSC policies on passwords, security guidelines, resource use, and privacy

 

If you have questions at any time, you can send email to bridges@psc.edu.

Content of this document

This document will tell you how to 

 

Create or change your PSC password

If you do not already have an active PSC account, you must create a PSC password (also called a PSC Kerberos password) before you can connect to Bridges.  Your PSC password is the same on all PSC systems, so if you have an active account on another PSC system, you do not need to reset it before connecting to Bridges.

Your PSC password is separate from your XSEDE Portal password. Resetting one password does not change the other password. 

Setting your initial PSC password

To set your initial PSC password, use the web-based PSC password change utility.

See PSC password policies.

Changing your PSC password

There are two ways to change or reset your PSC password:

When you change your PSC password, whether you do it via the online utility or via the kpasswd command on one PSC system, you change it on all PSC systems.

 

Connect to Bridges

When you connect to Bridges, you are connecting to one of Bridges' login nodes.   The login nodes are used for managing files, submitting batch jobs and launching interactive sessions.  They are not suited for production computing.  

See the Running Jobs section of this User Guide for information on production computing on Bridges.

There are several methods you can use to connect to Bridges.

  • You can access Bridges through a web browser by using the OnDemand software.  You will still need to understand Bridges' partition structure and the options which specify job limits like time and memory use, but OnDemand provides a more modern, graphical interface to Bridges.

    See the OnDemand section of this User Guide for more information.

  • You can connect to a traditional command line interface by logging in via one of these:

    • ssh, using either XSEDE or PSC credentials.  If you are registered with XSEDE for DUO Multi-Factor Authentication (MFA), you can use this security feature in connecting to Bridges.

      See the XSEDE instructions to set up DUO for MFA.

    • gsissh, if you have the Globus toolkit installed
    • XSEDE Single Sign On, including using Multi-Factor authentication if you are an XSEDE user

     

    This document explains how to use ssh, gsissh or XSEDE Single Sign On to access Bridges.

  

SSH

You can use an ssh client from your local machine to connect to Bridges using either your PSC or XSEDE credentials. 

SSH  is a program that enables secure logins over an unsecure network.  It encrypts the data passing both ways so that if it is intercepted it cannot be read.

SSH is client-server software, which means that both the user's local computer and the remote computer must have it installed.  SSH server software is installed on all the PSC machines. You must install SSH client software on your local machine. 

Read more about ssh in the Getting Started with HPC document.

Once you have an ssh client installed, you can use either your PSC credentials or XSEDE credentials (optionally with DUO MFA) to connect to Bridges.    Note that you must have created your PSC password before you can use ssh to connect to Bridges.

Use ssh to connect to Bridges using XSEDE credentials and (optionally) DUO MFA:

  1. Using your ssh client, connect to hostname bridges.psc.xsede.org  or bridges.psc.edu  using port 2222.
    Either hostname will connect you to Bridges, but you must specify port 2222.
  2. Enter your XSEDE username and password when prompted.
  3. (Optional) If you are registered with XSEDE DUO, you will receive a prompt on your phone.  Once you have answered it, you will be logged in.

Use ssh to connect to Bridges using PSC credentials:

  1. Using your ssh client, connect to hostname bridges.psc.xsede.org  or bridges.psc.edu  using the default port (22).
    Either hostname will connect you to Bridges. You do not have to specify the port.
  2. Enter your PSC username and password when prompted.

 

 Read more about using SSH to connect to PSC systems

 

gsissh

If you have installed the Globus toolkit you can use gsissh to connect to Bridges. Gsissh is a version of ssh which uses certificate authentication.   Use the command myproxy-logon to get a suitable certificate. The Globus toolkit includes a man page for myproxy-logon.

See more information on the Globus Toolkit website.

 

Public-private keys

You can also use public-private key pairs to connect to Bridges. To do so, you must first fill out this form to register your keys with PSC.

 

XSEDE single sign on

XSEDE users can use their XSEDE usernames and passwords in the XSEDE User Portal Single Sign On Login Hub (SSO Hub) to access bridges.psc.xsede.org  or bridges.psc.edu.    

You must use DUO Multi-Factor Authentication in the SSO Hub.

See the XSEDE instructions to set up DUO for Multi-Factor Authentication.

System Configuration

Bridges comprises four types of computational nodes:

Bridges' computational nodes supply 1.3018 Pf/s and 274 TiB RAM.  The Bridges system also includes more than 6PB of node-local storage and 10PB of shared storage in the Pylon file system.

In addtion to its computational nodes, Bridges contains a number of login, database, web server and data transfer nodes.

RSM nodes

Number 752
CPUs 2 Intel Haswell (E5-2695 v3) CPUs; 14 cores/CPU; 2.3 - 3.3 GHz
RAM 128GB, DDR4-2133
Cache 35MB LLC
Node-local storage 2 HDDs, 4TB each
Server HPE Apollo 2000

RSM-GPU nodes

Number 16 32
GPUs 2 NVIDIA Tesla K80 Kepler architecture 2 NVIDIA Tesla P100 Pascal architecture
CPUs 2 Intel Haswell (E5-2695 v3) CPUs; 14 cores/CPU; 2.3 - 3.3 GHz 2 Intel Broadwell E5-2683 v4 CPUs; 16 cores/CPU; 2.1 - 3.0 GHz
RAM 128GB, DDR4-2133 128GB, DDR4-2400
Cache 35MB LLC 40MB LLC
Node-local storage 2 HDDs, 4TB each 2 HDDs, 4TB each
Server HPE Apollo 2000 HPE Apollo 2000

LSM nodes

Number 8 34
CPUs 4 Intel Xeon E5-8860 v3 CPUs; 16 cores per CPU; 2.2 -3.2 GHz 4 Intel Xeon E7-8870 v4 CPUs; 20 cores/CPU; 2.1 -3.0 GHz
RAM 3TB, DDR4-2133 3TB, DDR4-2400
Cache 40MB LLC 50MB LLC
Node-local storage 4 HDDs, 4TB each 4 HDDs, 4TB each
Server HPE ProLiant DL580 HPE ProLiant DL580

ESM nodes

Number 2 2
CPUs 16 Intel Xeon E7-8880 v3 CPUs; 18 cores/CPU; 2.3 - 3.1 GHz 16 Intel Xeon E7-8880 v4 CPUs; 22 cores/CPU; 2.2 - 3.3 GHz
RAM 12TB, DDR4-2133 12TB, DDR4-2400
Cache 45MB LLC 55MB LLC
Node-local storage 16 HDDs, 4TB each 16 HDDs, 4TB each
Server HPE Integrity Superdome X HPE Integrity Superdome X

Database, web server, data transfer, login nodes

CPUs 2 Intel Xeon E5 series CPUs; 14 cores/CPU; 2.3 - 3.3 GHz
RAM 128GB
Cache 35MB LLC
Node-local storage Database nodes have additional SSDs or HDDs
Server HPE ProLiant DL360s or HPE ProLiant DL380s

Account Administration

Charging

Bridges regular 

The  RSM nodes are allocated as "Bridges regular".  This does not include Bridges' GPU nodes.  Service Units are defined in terms of compute resources: 

1 SU = 1 core-hour

Bridges large

 The LSM and ESM nodes are allocated as "Bridges large".  Service Units are defined in terms of memory requested:

1 SU = 1 TB-hour 

Bridges GPU

Bridges contains two kinds of GPU nodes: NVIDIA Tesla K80s and NVIDIA Tesla P100s. Because of the difference in the performance of the nodes, the charges will be different for the two types of nodes.

K80 nodes 

The K80 nodes hold 4 GPU units each, each of which can be allocated separately.  Service units (SUs) are defined in terms of GPU-hours:

1 GPU-hour = 1 SU

Note that the use of an entire K80 GPU node for one hour would be charged 4 SUs.

P100 nodes

The P100 nodes hold 2 GPU units each, which can be allocated separately.  Service units (SUs) are defined in terms of GPU-hours:

1 GPU-hour = 2.5 SUs

Note that the use of an entire P100 node for one hour would be charged 5 SUs.

 

Managing multiple grants

If you have more than one grant, be sure to charge your usage to the correct one.  Usage is tracked by group name.

Find your group names

To find your group names, use the id command.

id -Gn

will list all the groups you belong to.

Find your current group

id -gn

will list the group you associated with your current session.  

Change your group for a login session

To change the group which will be charged for usage during a login session, use the newgrp command.

newgrp groupname

Until you logout (or issue another newgrp command) , groupname is charged for all usage.  All files created during this time will belong to groupname, and their storage is charged against the quota for groupname.

On the next login, your default group is back in effect, and will be charged for usage.

Change your default group permanently

Your primary group is charged with all usage by default.  To change your primary group, the group to which your SLURM jobs are charged by default, use the change_primary_group command.  Type:

change_primary_group -l

to see all your groups.  Then type

change_primary_group groupname

to set groupname as your default group.

 

Charging for batch or interactive use

Batch jobs and interactive sessions are charged to your primary group by default.  To charge your usage to a different group, you must specify the appropriate group  with the -A groupname  option to the SLURM sbatch command.   See the Running Jobs section of this Guide for more information on batch jobs,  interactive sessions and SLURM.

Please note that any files created during a job are owned by your primary group, no matter which group is charged for the job.

 

Tracking your usage

There are several methods you can use to track your Bridges usage. The xdusage command is available on Bridges. There is a man page for xdusage. The projects  command will also help you keep tabs on your usage.  It shows grant  information, including usage and the pylon directories associated with the grant.

Type:

projects

 

For more detailed accounting data you can use the Grant Management System.   You can also track your usage through the XSEDE User Portal. The xdusage and projects command and the XSEDE Portal accurately reflect the impact of a Grant Renewal but the Grant Management System currently does not.

Managing your XSEDE allocation

Most account management functions for your XSEDE grant are handled through the XSEDE User Portal.  You can search the Knowledge Base to get  help.  Some common questions:

File Spaces

There are several distinct file spaces available on Bridges, each serving a different function.

  • Home ($HOME), your home directory on Bridges
  • pylon5 ($SCRATCH),  a Lustre system for persistent file storage.  Pylon5 has replaced pylon1.
  • Node-local storage ($LOCAL), scratch storage in the local memory associated with a running job
  • Memory storage ($RAMDISK), scratch storage on the local disk associated with a running job

Note that pylon2  has been decommissioned on June 19, 2018.  

File expiration

Three months after your grant expires all of your Bridges files associated with that grant will be deleted, no matter which file space they are in. You will be able to login during this 3-month period to transfer files, but you will not be able to run jobs or create new files.

File permissions

Access to files in any Bridges space is governed by Unix file permissions which you control.  If  your data has additional security or compliance requirements, please contact compliance@psc.edu.  

 

Home ($HOME)

This is your Bridges home directory. It is the usual location for your batch scripts, source code and parameter files. Its path is /home/username, where  username is your PSC userid. You can refer to your home directory with the environment variable $HOME. Your home directory is visible to all of Bridges's nodes.

Your home directory is backed up daily, although it is still a good idea to store copies of your important  files in another location, such as the pylon5 file system or on a local file system at your site. If you need to recover a home directory file from backup send email to remarks@psc.edu. The process of recovery will take 3 to 4 days.

$HOME quota

Your home directory has a 10GB quota. You can check your home directory usage using the quota command or the command du -sh. To improve the access speed to your home directory files you should stay as far below your home directory quota as you can.

Grant expiration

Three months after a grant expires, the files in your home directory associated with that grant will be deleted.

 

pylon5 ($SCRATCH)

The pylon5 file system is persistent storage, and can be used as working space for your running jobs. It provides fast access for data read or written by running jobs.  IO to pylon5 is much faster than to your home directory.

Pylon5 is a Lustre file system shared across all of Bridges' nodes.  It is available on Bridges compute nodes as $SCRATCH.

Files on pylon5 are not backed up, so you should store copies of important pylon5 files in another location.

 

pylon5 directories

The path of your pylon5 home directory is /pylon5/groupname/username, where groupname is the name for the PSC group associated with your grant.  Use the id command to find your group name.

The command id -Gn will list all the groups you belong to.
The command id -gn will list the group associated with your current session.

If you have more than one grant, you will have a pylon5 directory for each grant. Be sure to use the appropriate directory when working with multiple grants.

pylon5 quota

Your usage quota for each of your grants is the Pylon storage allocation you received when your proposal was approved.  If your total use in pylon5 exceeds this quota your access to the partitions on Bridges will be shut off until you are under quota.

Use the du -sh  or projects command to check your pylon5 usage. You can also check your usage on the XSEDE User Portal.

If you have multiple grants, it is very important that you store your files in the correct pylon5 directory.

Grant expiration

Three months after a grant expires, the files in any pylon5 directories associated with that grant will be deleted.

  

Node-local ($LOCAL)

Each of Bridges's nodes has a local file system attached to it. This local file system is only visible to the node to which it is attached.  The local file system provides fast access to local storage.

This file space is available on all nodes as $LOCAL.

If your application performs a lot of small reads and writes, then you could benefit from using $LOCAL. Many genomics applications are of this type.

$LOCAL is only available when your job is running, and can only be used as working space for a running job. Once your job finishes your local files are inaccessible and deleted. To use local space, copy files to $LOCAL at the beginning of your job and back out to a persistent file space before your job ends.

If a node crashes all the $LOCAL files are lost. Therefore, you should checkpoint your $LOCAL files by copying them to pylon5 during long runs.

Multi-node jobs

If you are running a multi-node job the variable $LOCAL points to the local file space on the node that is running your rank 0 process.

You can use the srun command to copy files between $LOCAL on the nodes in a multi-node job.  See the MPI job script in the Running Jobs section of this User  Guide for details.

$LOCAL size

The maximum amount of local space varies by node type. The RSM (128GB) nodes have a maximum of 3.7TB.  The LSM (3TB) nodes have a maximum of 14TB and the  ESM (12TB) nodes have a maximum of 49TB.

To check on your local file space usage type:

du -sh

 There is  no charge for the use of $LOCAL.

Using $LOCAL

To use $LOCAL you must first copy your files to $LOCAL at the beginning of your script, before your executable runs. The following script is an example of how to do this

RC=1
n=0
while [[ $RC -ne 0 && $n -lt 20 ]]; do
    rsync -aP $sourcedir $LOCAL/
    RC=$?
    let n = n + 1
    sleep 10
done

Set $sourcedir to point to the directory that contains the files to be copied before you execute your program. This code will try at most 20 times to copy your files. If it succeeds, the loop will exit. If an invocation of rsync was unsuccessful, the loop will try again and pick up where it left off.

At the end of your job you must copy your results back from $LOCAL or they will be lost. The following script will do this.

mkdir $SCRATCH/results
RC=1
n=0
while [[ $RC -ne 0 && $n -lt 20 ]]; do
    rsync -aP $LOCAL/ $SCRATCH/results
    RC=$?
    let n = n + 1
    sleep 10
done

This code fragment copies your files to a directory in your pylon5 file space named results, which you must have created previously with the mkdir command. Again it will loop at most 20 times and stop if it is successful.

 

 

Memory files ($RAMDISK)

You can also use the memory allocated for your job for IO rather than using disk space. This will offer the fastest IO on Bridges.

In a running job the environment variable $RAMDISK will refer to the memory associated with the nodes in use.

The amount of memory space available to you depends on the size of the memory on the nodes and the number of nodes you are using. You can only perform IO to the memory of nodes assigned to your job.

If you do not use all of the cores on a node, you are allocated memory in proportion to the number of cores you are using.  Note that you cannot use 100% of a node's memory for IO; some is needed for program and data usage.

$RAMDISK is only available to you while your job is running, and can only be used as working space for a running job. Once your job ends this space is inaccessible. To use memory files, copy files to $RAMDISK at the beginning of your job and back out to a permanent space before your job ends.  If your job terminates abnormally your memory files are lost.

 Within your job you can cd to $RAMDISK, copy files to and from it, and use it to open files.  Use the command du -sh to see how much space you are using.

If you are running a multi-node job the $RAMDISK variable points to the memory space on the node that is running your rank 0 process.

 

 

 

Transferring Files

Pylon2 to be decommissioned June 19

All of your pylon2 files should be moved to other spaces and deleted from pylon2. Instructions to move your files to pylon5 are below.

If you have questions or run into any issues in moving your files from pylon2 to pylon5, please let us know by emailing bridges@psc.edu.

Instructions for transferring files from pylon2 to pylon5

Pylon2 has been unmounted from the Bridges login nodes. This means that you cannot see or access any pylon2 directory from a login node.  You will get a "No such file or directory" error if you try.

You can use the rsync command or the Globus web application to transfer your files from pylon2 to pylon5.  We suggest you use rsync.

Remember to delete your pylon2 files once your transfers have finished.

rsync

The rsync command can be run on Bridges' compute nodes in an interactive session or a batch job, or by using ssh on one of Bridges' high-speed data transfer nodes.  An advantage of rsync is that if the transfer does not complete, you can rerun the rsync command, and rsync will copy only those files which have not already been transferred.

PSC has created a shell script that you can use to move files from pylon2 to pylon5.  The shell script can be used in an interactive session or in a batch job.

Note that rsync will overwrite files in the destination directory if a file of the same name with a more recent modified time exists in the source directory.    To prevent this, the examples below copy pylon2 files into a new subdirectory on pylon5.  Once transferred, please examine the files and move them to the directory where you want them.

In all the examples given here, change groupname, username and new-directory to be your charging group, userid and name of the new subdirectory (if you like) to store the files.

Shell script 

The PSC-provided shell script is /opt/packages/utilities/pylon2to5. It will:

  • Copy all files from your pylon2 home directory (/pylon2/groupname/username) to a subdirectory named "from_pylon2" under your pylon5 home directory (/pylon5/groupname/username/from_pylon2)
  • Loop until it succeeds for all files or is killed (e.g. due to timeout).  This could use a lot of SUs if failures persist.
  • Skip copying older files from pylon2 on top of newer files in pylon5 with the same name. This is default rsync behavior.

Interactive session

Start an interactive session, type

interact

To use the PSC-supplied shell script, when your interactive session begins type

/opt/packages/utilities/pylon2to5

If you prefer not to use the PSC-supplied shell script, when your session begins you can use a command like

rsync -av /pylon2/groupname/username/    /pylon5/groupname/username/new-directory

Batch job

To run the PSC-supplied shell script in a batch job, create a batch script with the following content:

#!/bin/bash
#SBATCH -c 4
#SBATCH -p RM-shared
#SBATCH -
## newgrp my_other_grant
date 
## The next line runs the shell script
/opt/packages/utilities/pylon2to5

To move files from a grant that is not your default, uncomment the newgrp command in the script by removing "##" and substitute the correct group name for "my_other_grant".

If you prefer not to use the PSC-supplied shell script, you can create a batch script  which runs rsync like the one shown here.  Change groupname, username and new-directory to be your charging group, userid and name of the new subdirectory (if you like) to store the files.

#!/bin/bash
#SBATCH -p RM-shared
rsync -av /pylon2/groupname/username/    /pylon5/groupname/username/new-directory

Submit your batch script by typing (where script-name is the name of your script file)

sbatch script-name

Data transfer node

Use Bridges' high-speed data transfer nodes to move your files from pylon2 to pylon5.  At the Bridges' prompt, type

ssh data.bridges.psc.edu "rsync -av /pylon2/groupname/username/    /pylon5/groupname/username/new-directory"

Globus

To use the Globus web application, visit www.globus.org.

  • Choose “PSC Bridges with XSEDE Authentication” as each endpoint (you will need to authenticate with your XSEDE login and password).
  • For the first path, choose the pylon2 directory that you wish to transfer files from. For example: /pylon2/chargeid/userid
  • For the second path, choose the appropriate pylon5 target directory: /pylon5/chargeid/userid

To find your chargeid on Bridges, use the projects command to see all of the allocations that you have access to.

  • At the bottom of the Globus transfer page, choose the Transfer Settings that you wish to use (e.g. “preserve source file modification times”) and transfer your files as you would through any other web application.

Content of this document

 

Paths for Bridges file spaces

For all file transfer methods other than cp, you must always use the full path for your Bridges files.  The start of the full paths for your Bridges directories are:

Home directory     /home/username

Pylon2 directory   /pylon2/groupname/username

Pylon5 directory   /pylon5/groupname/username

The command id -Gn will show all of your valid groupnames.  You  have a pylon2 and pylon5 directory for each grant you have.

 

Transfers into your Bridges home directory 

Your home directory quota is 10GB, so large files cannot be stored there; they should be copied into one of your pylon file spaces instead. Exceeding your home directory quota will prevent you from writing more data into your home directory and will adversely impact other operations you might want to perform.  

 

rsync

 

You can use the rsync command to copy files to and from Bridges. A sample rsync command to copy to a Bridges directory is

rsync -rltpDvp -e 'ssh -l joeuser' source_directory data.bridges.psc.edu:target_directory

 

Substitute your userid for 'joeuser'. Make sure you use the correct group name in your target directory. By default, rsync will not copy older files with the same name in place of newer files in the target directory. It will overwrite older files.

 

We recommend the rsync options -rltDvp. See the rsync man page for information on these options and other you options you might want to use. We also recommend the option

-oMACS=umac-64@openssh.com

If you use this option your transfer will use a faster data validation algorithm.

 

You may to want to put your rsync command in a loop to insure that it completes. A sample loop is

RC=1
n=0
while [[ $RC -ne 0 && $n -lt 20 ]] do
    rsync ...
    RC = $?
    let n = n + 1
    sleep 10
done

 

This loop will try your rsync command 20 times. If it succeeds it will exit. If an rsync invocation is unsuccessful the system will try again and pick up where it left off. It will copy only those files that have not already been transferred. You can put this loop, with your rsync command, into a batch script and run it with sbatch.

 

Globus

Globus can be used for any file transfer to Bridges. It tracks the progress of the transfer and retries when there is a failure; this makes it especially useful for transfers involving large files or many files. This includes transfers between the pylon2 and pylon5 filesystems and file transfers between the Data Supercell and pylon2 or pylon5.  

To use Globus to transfer files you must authenticate either via a Globus account or with InCommon credentials

Globus account

You can set up a Globus account at the Globus site.

InCommon credentials

If you wish to use InCommon credentials to transfer files to/from Bridges, you must first provide your CI Login Certificate Subject information to PSC.  Follow these steps:

  1. Find your Certificate Subject string
    1. Navigate your web browser to https://cilogon.org/.
    2. Select your institution from the 'Select an Identity Provider' list.
    3. Click the 'Log On' button.  You will be taken to the web login page for your institution.
    4. Login with your username and password for your institution.
      • If your institution has an additional login requirement (e.g., Duo), authenticate to that as well.

      After successfully authenticating to your institution's web login interface, you will be returned to the CILogon webpage.  Note the boxed section near the top that lists a field named 'Certificate Subject'.

      Certificate Subject
  2. Send your Certificate Subject string to PSC
    1. In the CILogon webpage, select and copy the Certificate Subject text. Take care to get the entire text string if it is broken up onto multiple lines.
    2. Send email to support@psc.edu.  Paste your Certificate Subject field into the message, asking that it be mapped to your PSC username.

Your CI Login Certificate Subject information will be added within one business day, and you will be able to begin transferring files to and from Bridges.

 

Globus endpoints

Once you have the proper authentication you can initiate file transfers from the Globus site.  A Globus transfer requires a Globus endpoint, a file path and a file name for both the source and destination.  The endpoints for Bridges are:

  • psc#bridges-xsede if you are using an XSEDE User Portal account for authentication
  • psc#bridges-cilogon if you are using InCommon for authentication

These endpoints are owned by psc@globusid.org. If you use DUO MFA for your XSEDE authentication, you do not need to because you cannot use it with Globus. You must always specify a full path for the Bridges file systems.  See Paths for Bridges file spaces for details.

 

Globus-url-copy

The globus-url-copy command can be used if you have access to Globus client software.  Both the globus-url-copy and myproxy-logon commands are available on Bridges, and can be used for file transfers internal to the PSC.

To use globus-url-copy you must have a current user proxy certificate.  The command grid-proxy-info will tell you if you have current user proxy certificate and if so, what the remaining life of your certificate is.

Use the myproxy-logon command to get a valid user proxy certificate if any one of these applies:

  • you get an error from the grid-proxy-info command
  • you do not have a current user proxy certificate
  • the remaining life of your certificate is not sufficient for your planned file transfer

When prompted for your MyProxy passphrase enter your XSEDE Portal password.

To use globus-url-copy for transfers to a machine you must know the Grid FTP server address.  The Grid FTP server address for Bridges is

gsiftp://gridftp.bridges.psc.edu

The use of globus-url-copy always requires full paths. See Paths for Bridges file spaces for details.

 

scp

To use scp for a file transfer you must specify a source and destination for your transfer.  The format for either source or destination is

username@machine-name:path/filename

For transfers involving Bridges,  username is your PSC username.  The machine-name should be given as data.bridges.psc.edu. This is the name for a high-speed data connector at PSC. We recommend using it for all file transfers using scp involving Bridges.  Using it prevents file transfers from disrupting interactive use on Bridges' login nodes.

File transfers using scp must specify full paths for Bridges file systems. See Paths for Bridges file spaces for details.

sftp

To use sftp, first connect to the remote machine:

    sftp username@machine-name

When  Bridges is the remote machine, use your PSC userid as  username. The Bridges machine-name should be specified as data.bridges.psc.edu. This is the name for a high-speed data connector at PSC.  We recommend using it for all file transfers using sftp involving Bridges.  Using it prevents file transfers from disrupting interactive use on Bridges' login nodes.

You will be prompted for your password on the remote machine. If Bridges is the remote machine enter your PSC password.

You can then enter sftp subcommands, like put to copy a file from the local system to the remote system, or get to copy a file from the remote system to the local system.  

To copy files into Bridges you must either cd to the proper directory or use full pathnames in your file transfer commands. See Paths for Bridges file spaces for details.

 

Two-factor Authentication

If you are required to use two-factor authentication (TFA) to access Bridges' filesystems, you must enroll in XSEDE DUO.  Once that is complete, use scp or sftp to transfer files to/from Bridges.

TFA users must use port 2222 and XSEDE Portal usernames and passwords.  The machine name for these transfers is data.bridges.psc.edu.

In the examples below, myfile is the local filename, XSEDE-username is your XSEDE Portal username and /path/to/file is the full path to the file on a Bridges filesystem. Note that -P ( capital P) is necessary.

scp

Transfer a file from a local machine to Bridges:

scp -P 2222 myfile XSEDE-username@data.bridges.psc.edu:/path/to/file

Transfer a file from Bridges to a local machine:

scp -P 2222 XSEDE-username@data.bridges.psc.edu:/path/to/file myfile

sftp

Use sftp interactively:

sftp -P 2222 XSEDE-username@data.bridges.psc.edu

Then use the put command to copy a file from the local machine to Bridges, or the get command to transfer a file from Bridges to the local machine.

 

Graphical SSH client

If you are using a graphic SSH client, configure it to connect to data.bridges.psc.edu on port 2222/TCP. Login using your XSEDE Portal username and password.

 

Transfer rates

PSC maintains a Web page at http://speedpage.psc.edu that lists average data transfer rates between all XSEDE resources.  If your data transfer rates are lower than these average rates or you believe that your file transfer performance is subpar, send email to bridges@psc.edu.  We will examine approaches for improving your file transfer performance.

Programming Environment

Bridges provides a rich programming environment for the development of applications.

C, C++ and Fortran 

Intel, Gnu and PGI compilers for C, C++ and Fortan are available on Bridges.  The compilers are:

  C C++ Fortran
Intel icc icpc ifort
Gnu gcc g++ gfortran
PGI pgcc pgc++ pgfortran

The Intel and Gnu compilers are loaded for you automatically.

To run the PGI compilers you must first issue the command

      module load pgi

There are man pages for each of the compilers.

 

OpenMP programming

 

To compile OpenMP programs you must add an option to your compile command:

Intel -qopenmp
for example: icc -qopenmp myprog.c
Gnu -fopenmp
for example: gcc -fopenmp myprog.c
PGI -mp
for example: pgcc -mp myprog.c

 

Useful information about using OpenMP is available online.

MPI programming

 

Three types of MPI are supported on Bridges: MVAPICH2, OpenMPI and Intel MPI.

There are two steps to compile an MPI program:

  1. Load the correct module for the compiler and MPI type you want to use, unless you are using  Intel MPI.  The Intel MPI module is loaded for you on login.
  2. Issue the appropriate MPI wrapper command to compile your program

The three MPI types  may perform differently on different problems or in different programming environments.   If you are having trouble with one type of MPI, please try using another type.  Contact bridges@psc.edu for more help.

 

Compiler commands for MPI programs

Intel compilers

To use the Intel compilers withLoad this moduleCompile with this command
C C++ Fortran
Intel MPI  none, this is loaded by default mpiicc mpiicpc mpiifort
OpenMPI mpi/intel_openmpi mpicc mpicxx mpifort
MVAPICH2 mpi/intel_mvapich mpicc code.c -lifcore mpicxx code.cpp -lifcore mpifort code.f90 -lifcore

For proper Intel MPI behavior, you must set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to 0. Otherwise the mpirun task placement settings you give will be ignored.

BASH:

export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0

CSH:

setenv I_MPI_JOB_RESPECT_PROCESS_PLACEMENT 0

Gnu compilers

To use the Gnu compilers withLoad this moduleCompile with this command
C C++ Fortran
Intel MPI none, this is loaded by default mpicc mpicxx mpifort
OpenMPI mpi/gcc_openmpi
MVAPICH2 mpi/gcc_mvapich

PGI compilers

To use the PGI compilers withLoad this moduleCompile with this command
C C++ Fortran
OpenMPI mpi/pgi_openmpi mpicc mpicxx mpifort
MVAPICH2 mpi/pgi_mvapich

  

Other languages

Other languages, including Java, Python, R,  and MATLAB, are available.  See the software page for information.

 

Debugging and performance analysis

 

DDT

DDT is a debugging tool for C, C++ and Fortran 90 threaded and parallel codes.  It is client-server software.  Install the client on your local machine and then you can access the GUI on Bridges to debug your code.

See the DDT page for more information.

 

VTune

VTune is a performance analysis tool from Intel for serial, multithreaded and MPI applications.  Install the client on your local machine and then you can access the GUI on Bridges.  See the VTune page for more information.

 

Running Jobs

The SLURM scheduler  (Simple Linux Utility for Resource Management) manages and allocates all of Bridges' compute nodes. All of your production computing must be done on Bridges' compute nodes.  You cannot use Bridges' login nodes to do your work on.  

Several partitions have been set up in SLURM to allocate resources efficiently.  Partitions can be considered job queues.  Different partitions control different types of Bridges' resources;  they are configured by the type of node and other job requirements.  You will choose the appropriate partition to run your jobs in based on the resources you need.

Regardless of which partition you use, you can work on Bridges in either  interactive mode - where you type commands and receive output back to your screen as the commands complete- or batch mode - where you first create a batch (or job) script which contains the commands to be run, then submit the job to be run as soon as resources are available.

This document contains the following sections:

Partitions

Each SLURM partition manages a subset of Bridges' resources.  Each partition allocates resources to both interactive sessions and batch jobs that request resources from it.  There are five partitions organized by the type of resource they control:

  • RM, for jobs that will run on Bridges' RSM (128GB) nodes.
  • RM-shared, for jobs that will run on Bridges' RSM (128GB) nodes, but share a node with other jobs.
  • RM-small, for short jobs needing 2 full nodes or less, that will run on Bridges' RSM (128GB) nodes.  Nodes can be shared with other jobs.
  • GPU, for jobs that will run on Bridges' GPU nodes.
  • GPU-shared, for jobs that will run on Bridges' GPU nodes, but share a node with other jobs
  • GPU-small, for jobs that will use only one of Bridges' GPU nodes and 8 hours or less of wall time.
  • LM, for jobs that will run on Bridges' LSM and ESM (3TB and 12TB) nodes.

All the partitions use FIFO scheduling. If the top job in the partition will not fit on the machine, SLURM will skip that job and try to schedule the next job in the partition. The scheduler also follows policies to insure that one user does not dominate the machine: there are limits to the number of nodes and cores a user can simultaneously use.

Use your allocation wisely:  To make the most of your allocation, use the shared partitions whenever possible.  Jobs in the RM and GPU partitions are charged for the use of all cores on a node.  Jobs in the RM-shared and GPU-shared partitions share nodes, and are only charged for the cores they are allocated. The RM partition is the default for the sbatch command, while RM-shared is the default for the interact command. The interact and sbatch commands are discussed below.

Know which partitions are open to you: Your Bridges allocations determine which partitions you can submit jobs to.

A "Bridges regular" allocation allows you to use the RM, RM-shared and RM-small partitions.
A "Bridges GPU" allocation allows you to use the GPU and GPU-shared partitions.
A "Bridges large" allocation allows you to use the LM partition.

If you have more than one grant, be sure to use the correct one when submitting jobs.  See the -A option to sbatch to change the grant that is charged for a job. 

Note that any files created by a job are owned by the grant in effect when the job is submitted, which is not necessarily the grant charged for the job.  See the newgrp command to change the grant currently in effect.

 

This table summarizes the resources available and limits on Bridges' partitions.  More information on each partition follows.

PartitionNode typeNodes shared?Node defaultNode maxCore defaultCore maxGPU defaultGPU maxWalltime defaultWalltime maxMemory 
RM 128GB
28 cores
8TB on-node storage
No 1 168
If you need more than 168, contact bridges@psc.edu to make special arrangements.
28/node 28/node N/A N/A 30 min 48 hrs 128GB/node
RM-shared 128GB
28 cores
8TB on-node storage
Yes 1 1 1 28 N/A N/A 30 min 48 hrs 4.5GB/core
RM-small 128GB
28 cores
8TB on-node storage
Yes 1 2 1 28/node N/A N/A 30 min 8 hrs 4.5GB/core
GPU P100 nodes
2 GPUs
2 16-core CPUs
8TB on-node storage
No 1

8

There is a limit of 16 GPUs per job.

Because there are 2 GPUs on each P100 node, you can request at most 8 nodes.

32/node 32/node 2 per node 2 per node 30 min 48 hrs 128GB/node
K80 nodes
128GB
4 GPUs
2 14-core CPUS
8TB on-node storage
No 1

4

There is a limit of 16 GPUs per job.  Because there are 4 GPUs on each K80 node, you can request at most 4 nodes.

28/node 28/node 4 per node 4 per node 30 min 48 hrs 128GB/node
GPU-shared P100 nodes
2 GPUs
2 16-core CPUS
8TB on-node storage
Yes 1 1 16/GPU 16/GPU No default 2 30 min 48 hrs 7GB/GPU
K80 nodes
4 GPUs
2 14-core CPUs
8TB on-node storage
Yes 1 1 7/GPU 7/GPU No default 4 30 min 48 hrs 7GB/GPU
GPU-small P100 nodes
2 GPUs
2 16-core CPUS
8TB on-node storage
No 1 2
Because there are 2 GPUs on each P100 node, you can request at most 2 nodes.
32/node 32/node No default 30 min 8 hrs 128GB/node
LM LSM nodes
3TB RAM
16TB on-node storage

ESM nodes
12TB RAM
64TB on-node storage
Yes 1

42 for 3-TB nodes

4 for12-TB nodes

Jobs in LM are allocated 1 core/48GB of memory requested. N/A N/A 30 min 14 days up to 12000GB

 

Partition summaries

 

RM partition

Jobs in the RM partition run on Bridges' RSM (128GB) nodes.  Jobs do not share nodes, and are allocated all 28 of the cores on each of the nodes assigned to them.  A job in the RM partition is charged for all 28 cores per node on its assigned nodes. 

RM jobs can use more than one node.  However, the memory space of  all the nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

The internode communication performance for jobs in the RM partition is best when using 42 or fewer nodes. 

When submitting a job to the RM partition, you should specify:

  • the number of  nodes
  • the walltime limit 

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands. 

 

RM-shared partition

Jobs in the RM-shared partition run on Bridges' RSM (128GB) nodes.  Jobs will share nodes, but not cores.   A job in the RM-shared partition will be charged only for the cores allocated to it, so it will use fewer SUs than a RM job.  It could also start running sooner.

RM-shared jobs are assigned memory in proportion to the number of requested cores.   They get the fraction of the node's total memory in proportion to the number of cores requested. If the job exceeds this amount of memory it will be killed.

When submitting a job to the RM-shared partition, you should specify:

  • the number of cores
  • the walltime limit

For information on requesting resources and submitting a job to the RM-shared partition see the section below on the interact or the sbatch commands.

 

RM-small partition

Jobs in the RM-small partition run on Bridges' RSM (128GB) nodes, but are limited to at most 2 full nodes and 8 hours.  Jobs can share nodes.  

Note that the memory space of  all the nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

When submitting a job to the RM-small partition, you should specify:

  • the number of nodes
  • the number of cores
  • the walltime limit

For information on requesting resources and submitting a job to the RM partition see the section below on the interact or the sbatch commands. 

 

GPU partition

Jobs in the GPU partition use Bridges' GPU nodes.  Note that Bridges has 2 types of GPU nodes: K80s and P100s.  See the System Configuration section of this User Guide for the details of each type.

Jobs in the  GPU partition do not share nodes, so  jobs are allocated all the cores associated with the nodes assigned to them and all of the GPUs. Your job will be charged for all the cores associated with your assigned nodes.

However, the memory space across nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

When submitting a job to the GPU partition, you must specify the number of GPUs.

You should also specify:

  • the type of node you want, K80or P100, with the --gres=type option to the interact or sbatch commands.  K80 is the default if no type is specified.  See the sbatch command options below for more details.
  • the number of nodes
  • the walltime limit 

For information on requesting resources and submitting a job to the GPU partition see the section below on the interact or the sbatch commands. 

 

GPU-shared partition

Jobs in the GPU-shared partition run on Bridges's GPU nodes.  Note that Bridges has 2 types of GPU nodes: K80s and P100s.  See the System Configuration section of this User Guide for the details of each type.

Jobs in the GPU-shared partition share nodes, but not cores. By sharing nodes your job will be charged less.  It could also start running sooner.

You will always run on (part of) one node in the GPU-shared partition.

Your jobs will be allocated memory in proportion to the number of requested GPUs. You get the fraction of the node's total memory in proportion to the fraction of GPUs you requested. If your job exceeds this amount of memory it will be killed.

When submitting a job to the GPU-shared partition, you must specify the number of GPUs.  

You should also specify:

  • the type of node you want, K80or P100, with the --gres=type option to the interact or sbatch commands.  K80 is the default if no type is specified.  See the sbatch command options below for more details.
  • the walltime limit

For information on requesting resources and submitting a job to the GPU-shared partition see the section below on the interact or the sbatch commands.

 

GPU-small

Jobs in the GPU-small partition run on one of Bridges' P100 GPU nodes.  

Your jobs will be allocated memory in proportion to the number of requested GPUs. You get the fraction of the node's total memory in proportion to the fraction of GPUs you requested. If your job exceeds this amount of memory it will be killed.

When submitting a job to the GPU-small partition, you must specify the number of GPUs with the --gres=gpu:p100:n  option to the interact or sbatch command.  In this partition, n can be 1 or 2.  

You should also specify the walltime limit.

For information on requesting resources and submitting a job to the GPU-small partition see the section below on the interact or the sbatch commands.

 

LM partition

Jobs in the LM partition always share nodes. They never span nodes.

When submitting a job to the LM partition, you must specify

  • the amount of memory in GB  - any value up to 12000GB can be requested
  • the walltime limit  

The number of cores assigned to jobs in the LM partition is proportional to the amount of memory requested. For every 48 GB of memory you will be allocated 1 core.

SLURM will place jobs on either a 3TB or a 12TB node based on the memory request.  Jobs asking for 3000GB or less will run on a 3TB node.  If no 3TB nodes are available but a 12TB node is available, the job will run on a 12TB node.

For information on requesting resources and submitting a job to the LM partition see the section below on the interact or the sbatch commands.

 

 

File systems for running jobs

We recommend you use your pylon5 space for file manipulation during a job.  It is Bridges' fastest filesystem. See the section on File Spaces in the Bridges User Guide for more information.

Interactive sessions

You must  be allocated the use of one or more Bridges' compute nodes by SLURM to work interactively on Bridges.  You cannot use the Bridges login nodes for your work.

You can run an interactive session in any of the SLURM partitions.  You will need to specify which partition you want,  so that the proper resources are allocated for your use.

Resources are set aside for interactive use. If those resources are all in use, your request will wait until the resources you need are available. Using a shared partition (RM-shared, GPU-shared) will probably allow your job to start sooner.

The interact command

To start an interactive session, use the command interact.  The format is

interact -options

The simplest interact command is

 interact

This command will start an interactive job using the defaults for interact, which are:

Partition: RM-shared
Cores: 1
Time limit: 60 minutes

 

The simplest interact command to start a GPU job is

 interact -gpu

This command will start an interactive job on a P100 node in the GPU-shared partition with 1 GPU and for 60 minutes.

 

Once the interact command returns with a command prompt you can enter your commands. The shell will be your default shell. When you are finished with your job type CTRL-D.

Note:

  • Be sure to charge your job to the correct group if you have more than one grant. See the -A option below to change the charging group for an interact job. Information on how to determine your valid groups and change your default group is in the Account adminstration section of the Bridges User Guide.
  • You are charged for your resource usage from the prompt appears until you type CTRL-D, so be sure to type CTRL-D as soon as you are done.   
  • The maximum time you can request is 8 hours. Inactive interact jobs are logged out after 30 minutes of idle time.

Options for interact 

If you want to run in a different partition, use more than one core or set a different time limit, you will need to use options to the interact command. 

The available options are:

OptionDescriptionDefault value
-p partition Partition requested RM-shared
-t HH:MM:SS

Walltime requested 

The maximum time you can request is 8 hours.

60:00 (1 hour)
-N n Number of nodes requested 1

--egress
Note the "--" for this option

Allows your compute nodes to communicate with sites external to Bridges. N/A
-A groupname

Group to charge the job to
Find or change your default group

Note: Files created during a job will be owned by the group in effect when the job is submitted. This may be different than the group the job is charged to. See the discussion of the newgrp command in the Account Administration section of this User Guide to see how to change the group currently in effect.

Your default group
-R reservation-name Reservation name, if you have one
Use of -R does not automatically set any other interact options. You still need to specify the other options (partition, walltime, number of nodes) to override the defaults for the interact command. If your reservation is not assigned to your default account, then you will need to use the -A option when you issue your interact command.
No default
--mem=nGB
Note the "--" for this option

Amount of memory requested in GB. This option should only
 be used for the LM partition.

No default
--gres=gpu:type:n
Note the "--" for this option

'type' is either p100 or k80. The default is k80.

'n' is the number of GPUs.  Valid choices are 

  • 1-4, when type=k80
  • 1-2, when type=p100
No default
-gpu Runs your job on 1 P100 GPU in the GPU-shared partition No default
--ntasks-per-node=n
Note the "--" for this option
Number of cores to allocate per node 1
-h Help, lists all the available command options  

Sample interact commands

Run in the RM-shared partition using 4 cores 

interact --ntasks-per-node=4

Run in the LM partition and request 2TB of memory

interact -p LM --mem=2000GB

Run in the GPU-shared partition and ask for 2 P100 GPUs.

interact -p GPU-shared --gres=gpu:p100:2

If you want more complex control over your interactive job you can use the srun command instead of the interact command.

 See the srun man page.

 

Batch jobs

To run a batch job, you must first create a batch (or job) script, and then submit the script  using the sbatch command.  

A batch script is a file that consists of SBATCH directives, executable commands and comments.

SBATCH directives specify  your resource requests and other job options in your batch script.  You can also specify resource requests and options  on the sbatch command line.  Any options on the command line take precedence over those given in the batch script. The SBATCH directives must start in column 1 (that is, be the first text on a line, with no leading spaces) with '#SBATCH'.

Comments begin with a '#' character.

The first line of any batch script must indicate the shell to use for your batch job.  

 

Change your default shell

The change_shell command allows you to change your default shell.   This command is only available on the login nodes.

To see which shells are available, type

change_shell -l

To change your default shell, type

change_shell newshell

where newshell is one of the choices output by the change_shell -l command.   You must use the entire pathoutput by change_shell -l, e.g. /usr/psc/shells/bash.  You must log out and back in again for the new shell to take effect.

Sample batch scripts

Some sample scripts are given here.  Note that:

Each script uses the bash shell, indicated by the first line '!#/bin/bash'.  Some Unix commands will differ if you use another shell.

For username and groupname you must substitute your username and your appropriate group.

 

Sample batch script for OpenMP job

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM
#SBATCH --ntasks-per-node 28 #SBATCH -t 5:00:00
# echo commands to stdout set -x
# move to your appropriate pylon5 directory
# this job assumes:
# - all input data is stored in this directory # - all output should be stored in this directory
cd /pylon5/groupname/username/path-to-directory
# run OpenMP program export OMP_NUM_THREADS=28 ./myopenmp

Notes:

        The --ntasks-per-node option indicates that you will use all 28 cores.

For groupname, username, and path-to-directory you must substitute your group, username, and appropriate directory path.

Sample batch script for MPI job

#!/bin/bash
#SBATCH -p RM
#SBATCH -t 5:00:00
#SBATCH -N 2
#SBATCH --ntasks-per-node 28
#echo commands to stdout set -x #move to your appropriate pylon5 directory cd /pylon5/groupname/username/path-to-directory
#set variable so that task placement works as expected
export  I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0

#copy input files to LOCAL file storage srun -N $SLURM_NNODES --ntasks-per-node=1 \
sh -c 'cp path-to-directory/input.${SLURM_PROCID} $LOCAL'
#run MPI program mpirun -np $SLURM_NTASKS ./mympi
# Copy output files to pylon5 srun -N $SLURM_NNODES --ntasks-per-node=1 \
sh -c 'cp $LOCAL/output.* /pylon5/groupname/username/path-to-directory'

Notes:

The variable $SLURM_NTASKS gives the total number of cores requested in a job. In this example $SLURM_NTASKS will be 56 because  the -N option requested 2 nodes and the --ntasks-per-node option requested all 28 cores on each node.

The export command sets I_MPI_JOB_RESPECT_PROCESS_PLACEMENT so that your task placement settings are effective. Otherwise, the SLURM defaults are in effect.

The srun commands are used to copy files between pylon5 and the $LOCAL file systems on each of your nodes.

The first srun command assumes you have two files named input.0 and input.1 in your pylon5 file space. It will copy input.0 and input.1 to, respectively, the $LOCAL file systems on the first and second nodes allocated to your job.

The second srun command will copy files named output.* back from your $LOCAL file systems to your pylon5 file space before your job ends. In this command '*' functions as the usual Unix wildcard.

For groupname, username, and path-to-directory you must substitute your group, username, and appropriate directory path.

Sample batch script for hybrid OpenMP/MPI job

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=14
#SBATCH --time=00:10:00
#SBATCH --job-name=hybrid
cd $SLURM_SUBMIT_DIR

mpiifort -xHOST -O3 -qopenmp -mt_mpi hello_hybrid.f90 -o hello_hybrid.exe
# set variable so task placement works as expected export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0
mpirun -print-rank-map -n $SLURM_NTASKS -genv \
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK -genv I_MPI_PIN_DOMAIN=omp \
./hello_hybrid.exe

Notes:

This example asks for 2 nodes, 4 MPI tasks and 14 OpenMP threads per MPI task.

The export command sets I_MPI_JOB_RESPECT_PROCESS_PLACEMENT so that your task placement settings are effective. Otherwise, the SLURM defaults are in effect.

Sample batch script for RM-shared partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node 2

#echo commands to stdout
set -x

#move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory cd /pylon5/groupname/username/path-to-directory
#run OpenMP program export OMP_NUM_THREADS 2 ./myopenmp

Notes:

When using the RM-shared partition the number of nodes requested with the -N option must always be 1. The --ntasks-per-node option indicates how many cores you want.

For groupname, username, and path-to-directory you must substitute your group, username, and directory path.

Sample batch script for GPU partition

#!/bin/bash
#SBATCH -N 2
#SBATCH -p GPU
#SBATCH --ntasks-per-node 28
#SBATCH -t 5:00:00
#SBATCH --gres=gpu:p100:2

#echo commands to stdout
set -x

#move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory cd /pylon5/groupname/username/path-to-directory

#run GPU program ./mygpu

Notes:

The value of the --gres-gpu option indicates the type and number of GPUs you want.

For groupname, username and path-to-directory you must substitute your group, username and appropriate directory path.

Sample batch script for GPU-shared partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH --ntasks-per-node 7 #SBATCH --gres=gpu:p100:1 #SBATCH -t 5:00:00
#echo commands to stdout set -x
#move to working directory # this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
cd /pylon5/groupname/username/path-to-directory

#run GPU program ./mygpu

Notes:

The option --gres-gpu indicates the number and type of GPUs you want.

For groupname, username and path-to-directory you must substitute your group, username, and appropriate directory path.

Sample script for job array job

#!/bin/bash
#SBATCH -t 05:00:00
#SBATCH -p RM-shared
#SBATCH -N 1
#SBATCH --ntasks-per-node 5
#SBATCH --array=1-5

set -x

./myexecutable $SLURM_ARRAY_TASK_ID

Notes:

The above job will generate five jobs that will each run on a separate core on the same node. The value of the variable SLURM_ARRAY_TASK_ID is the core number, which, in this example, will range from 1 to 5. Good candidates for job array jobs are jobs that can use only this core index to determine the different processing path for each job. For more information about job array jobs see the sbatch man page and the online SLURM documentation.

Sample batch script for bundling single-core jobs

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 05:00:00
#SBATCH --ntasks-per-node 14
 
echo SLURM NTASKS: $SLURM_NTASK
i=0
while [ $i -lt $SLURM_NTASKS ]
do
numactl -C +$i ./run.sh &
let i=i+1
done
wait # IMPORTANT: wait for all to finish or get killed

Notes:

Bundling or packing multiple jobs in a single job can improve your turnaround and improve the performance of the SLURM scheduler.

Sample batch script for bundling multi-core jobs

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 05:00:00
#SBATCH --ntasks-per-node 14
#SBATCH --cpus-per-task 2

echo SLURM NTASKS: $SLURM_NTASKS
i=0
while [ $i -lt $SLURM_NTASKS ]
do
numactl -C +$i ./run.sh &
let i=i+1
done wait # IMPORTANT: wait for all to finish or get killed

Notes:

Bundling or packing multiple jobs in a single job can improve your turnaround and improve the performance of the SLURM scheduler.

 

The sbatch command

To submit a batch job,  use the sbatch command.  The format is

sbatch -options batch-script

The options to sbatch can either be in your batch script or on your sbatch command line.  Options in the command line override those in the batch script.

Note:

  • Be sure to charge your job to the correct group if you have more than one grant. See the -A option below to change the charging group for an interact job. Information on how to determine your valid groups and change your default group is in the Account adminstration section of the Bridges User Guide.
  • In some cases, the options for sbatch differ from the options for interact or srun.

 

Examples of the sbatch command 

RM partition

An example of a sbatch command to submit a job to the RM partition is

sbatch -p RM -t 5:00:00 -N 1 myscript.job

where:

-p indicates the intended partition

-t is the walltime requested in the format HH:MM:SS

-N is the number of nodes requested

myscript.job is the name of your batch script

LM partition

Jobs submitted to the LM partition must request the amount of memory they need rather than the number of cores. Each core on the 3TB and 12TB nodes is associated with a fixed amount of memory, so the amount of memory you request determines the number of cores assigned to your job. The environment variable SLURM_NTASKS tells you the number of cores assigned to your job. Since there is no default memory value you must always include the --mem option for the LM partition.

A sample sbatch command for the LM partition is:

sbatch -p LM - t 10:00:00 --mem 2000GB myscript.job

where:

-p indicates the intended partition (LM)

-t is the walltime requested in the format HH:MM:SS

--mem is the amount of memory requested

myscript.job is the name of your batch script

Jobs in the LM partition do share nodes. They cannot span nodes. Your memory space for an LM job is an integrated, shared memory space.

 

Useful sbatch options

For more information about these options and other useful sbatch options see the sbatch man page

OptionDescription
-p partition Partition requested. Defaults to the RM partition.
-t HH:MM:SS Walltime requested in HH:MM:SS
-N n Number of nodes requested.
-A groupname

Group to charge the job to. If not specified, your default group is charged.  Find your default group

Note: Files created during a job will be owned by the group in effect when the job is submitted. This may be different than the group the job is charged to. See the discussion of the newgrp command in the Account Administration section of this User Guide to see how to change the group currently in effect.

--res reservation-name
Note the "--" for this option

Use the reservation that has been set up for you.  Use of --res does not automatically set any other options. You still need to specify the other options (partition, walltime, number of nodes) that you would in any sbatch command.  If your reservation is not assigned to your default account then you will need to use the -A option to sbatch to specify the account.

--mem=nGB
Note the "--" for this option

Memory in GB.  This option should only be used for the LM partition.
-C constraints

Specifies constraints which the nodes allocated to this job must satisfy.

An sbatch command can have only one -C option. Multiple constraints can be specified with "&". For example, -C LM&PH2 constrains the nodes to 3TB nodes with 20 cores and 38.5GB/core.   If mutilple -C commands are given, (e.g., sbatch ..... -C LM -C EGRESS) only the last applies.  The -C LM option will be ignored in this example.

Some valid constraints are:

EGRESS
Allows your compute nodes to communicate with sites external to Bridges
LM
Ensures that a job in the LM partition uses only the 3TB nodes. This option is required for any jobs in the LM partition which use /pylon5.
PH1
Ensures that the job will run on LM nodes which have 16 cores and 48GB/core
PH2
Ensures that the job will run on LM nodes which have 20 cores and 38.5GB/core
PERF
Turns on performance profiling. For use with performance profiling software like VTune, TAU

 See the discussion of the -C option in the sbatch man page for more information.

--gres=gpu:type:n
Note the "--" for this option

Specifies the type and number of GPUs requested. 'type' is either p100 or k80. The default is k80.

'n' is the number of requested GPUs. Valid choices are 1-4, when type is k80  and 1-2 when type is p100.

--ntasks-per-node=n
Note the "--" for this option

Request n cores be allocated per node. 

--mail-type=type
Note the "--" for this option

Send email when job events occur, where type can be BEGIN, END, FAIL or ALL.  

--mail-user=user
Note the "--" for this option

User to send email to as specified by -mail-type. Default is the user who submits the job. 
-d=dependency-list Set up dependencies between jobs, where dependency-list can be:
after:job_id[:jobid...]
This job can begin execution after the specified jobs have begun execution.
afterany:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated.
aftercorr:job_id[:jobid...]
A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero).
afternotok:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
afterok:job_id[:jobid...]
This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).
singleton
This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.
--no-requeue
Note the "--" for this option
Specifies that your job will be not be requeued under any circumstances. If your job is running on a node that fails it will not be restarted. Note the "--" for this option.

--time-min=HH:MM:SS
Note the "--" for this option.

Specifies a minimum walltime for your job in HH:MM:SS format.

SLURM considers the walltime requested when deciding which job to start next. Free slots on the machine are defined by the number of nodes and how long those nodes are free until they will be needed by another job. By specifying a minimum walltime you allow the scheduler to reduce your walltime request to your specified minimum time when deciding whether to schedule your job. This could allow your job to start sooner.

If you use this option your actual walltime assignment can vary between your minimum time and the time you specified with the -t option. If your job hits its actual walltime limit, it will be killed. When you use this option you should checkpoint your job frequently to save the results obtained to that point.

--switches=1
--switches=1@HH:MM:SS
Note the "--" for this option

Requests that the nodes your job runs on all be on one switch, which is a hardware grouping of 42 nodes.

If you are asking for more than 1 and fewer than 42 nodes, your job will run more efficiently if it runs on one switch.  Normally switches are shared across jobs, so using the switches option means your job may wait longer in the queue before it starts.

The optional time parameter gives a maximum time that your job will wait for a switch to be available. If it has waited this maximum time, the request for your job to be run on a switch will be cancelled.

-h Help, lists all the available command options

 

Other SLURM commands

 

sinfo

The sinfo command displays information about the state of Bridges's nodes. The nodes can have several states:

alloc Allocated to a job
down Down
drain Not available for scheduling
idle Free
resv Reserved

 

squeue

The squeue command displays information about the jobs in the partitions. Some useful options are:

-j jobid Displays the information for the specified jobid
-u username Restricts information to jobs belonging to the specified username
-p partition Restricts information to the specified partition
-l (long) Displays information including:  time requested, time used, number of requested nodes, the nodes on which a job is running, job state and the reason why a job is waiting to run.

 See the man page for squeue for more options, for a discussion of the codes for job state and for why a job is waiting to run.

 

scancel

The scancel command is used to kill a job in a partition, whether it is running or still waiting to run.  Specify the jobid for the job you want to kill.  For example,

scancel 12345

kills job # 12345.

 

sacct

The sacct command can be used to display detailed information about jobs. It is especially useful in investigating why one of your jobs failed. The general format of the command is

    sacct -X -j XXXXXX -S MMDDYY --format parameter1,parameter2, ...

 

For 'XXXXXX' substitute the jobid of the job you are investigating. The date given for the -S option is the date at which sacct begins searching for information about your job.

The --format option determines what information to display about a job. Useful paramaters are JobID, Partition, Account, ExitCode, State, Start, End, Elapsed, NodeList, NNodes, MaxRSS and AllocCPUs. The ExitCode and State parameters are especially useful in determining why a job failed. NNodes displays how many nodes your job used, while AllocCPUs displays how many cores your job used. MaxRSS displays how much memory your job used. The commas between the parameters in the --format option cannot be followed by spaces.

See the man page for sacct for more information about the sacct command.

 

Monitoring memory usage

It can be useful to find the memory usage of your jobs. For example, you may want to find out if memory usage was a reason a job failed or if you need to move your job from the 3-TB nodes to the 12-TB nodes.

There are two cases. You can determine a job's memory usage if it is still running or when it it has finished.

if your job is still running, which you can determine with the squeue command, you can issue the command

srun --jobid=XXXXXX top -b -n 1 | grep userid

For 'XXXXXX' substitute the jobid of your job. For 'userid' substitute your userid. The RES field in the top output shows the actual amount of memory used by a process. The top man page can be used to identify the fields of top output.

The other method to use for a running job is to issue the command

sstat -j XXXXXX.batch --format=JobID,MaxRss

For 'XXXXXX' substitute your jobid.

If your job has finished there are two methods to use to find out its memory usage. If you are checking within a day or two after your job has finished you can issue the command

sacct -j XXXXXX --format=JobID,MaxRss

If this command no longer shows a value for MaxRss then you must use the command

job_info XXXXXX | grep max_rss

Again, substitute your jobid for 'XXXXXX' in both of these commands.

There are man pages for top, srun, sstat and sacct if you need more information.

 

More help

There are man pages for all the SLURM commands. SLURM also has extensive online documentation.

 

OnDemand

The OnDemand interface allows you to conduct your research on Bridges through a web browser.   You can manage files - create, edit and move them - submit and track jobs, see job output, check the status of the queues, run a Jupyter notebook through JupyterHub and more, without logging in to Bridges via traditional interfaces.

OnDemand was created by the Ohio Supercomputer Center (OSC).  This document provides an outline of how to use OnDemand on Bridges. For more help, check the extensive documentation for OnDemand created by OSC, including many video tutorials, or email bridges@psc.edu. 

This document covers these topics:

Start OnDemand

To connect to Bridges via OnDemand, point your browser to https://ondemand.bridges.psc.edu.

  • You will be prompted for a username and password.  Enter your PSC username and password.
  • The OnDemand Dashboard will open.  From this page, you can use the menus across the top of the page to manage files and submit jobs to Bridges.

To end your OnDemand session, choose Log Out at the top right of the Dashboard window and close your browser.

 

Manage files

To create, edit or move files, click on the Files menu from the Dashboard window.  A dropdown menu will appear, listing all your file spaces on Bridges: your home directory and the pylon directories for each of your Bridges' grants. 

Choosing one of the file spaces opens the File Explorer in a new browser tab.  The files in the selected directory are listed.  No matter which directory you are in, your home directory is displayed in a panel on the left.

There are two sets of buttons in the File Explorer.

Buttons on the top left just below the name of the current directory allow you to ViewEdit, Rename, Download, Copy or Paste (after you have moved to a different directory) a file, or you can toggle the file selection with (Un)Select All.

Buttons on the top of the window on the right perform these functions:

Go To Navigate to another directory or file system
Open in Terminal Open a terminal window on Bridges in a new browser tab
New File Creates a new empty file
New Dir Create a new subdirectory
Upload Copies a file from your local machine to Bridges
Show Dotfiles Toggles the display of dotfiles
Show Owner/Mode Toggles the display of owner and permisson settings

 

Create and edit jobs

 You can create new job scripts and edit existing scripts, and submit those scripts to Bridges through OnDemand.

From the top menus in the Dashboard window, choose Jobs > Job Composer.  A Job Composer window will open.

There are two tabs at the top: Jobs and Templates.

In the Jobs tab, a listing of your jobs is gven. 

Create a new job script

To create a new job script:

  1. Select a template to begin with
  2. Edit the job script
  3. Edit the job options 

Select a template

  1. Go to the Jobs tab in the Jobs Composer window. You have been given a default template, named Simple Sequential Job.
  2. To create a new job script,  click the blue New Job > From Default Template button in the upper left. You will see a green message at the top of the window, "Job was successfully created".

At the right of the Jobs window, you will see the Job Details, including the location of the script and the script name (by default, main_job.sh).  Under that, you will see the contents of the job script in a section titled Submit Script.

Edit the job script

Edit the job script so that it has the commands and workflow that you need.

If you do not want the default settings for a job, you must include options to change them in the job script.  For example, you may need more time or more than one node.  For the GPU partitions, you must specify the type and number of GPUs you want.  For the LM partition, you must specify how much memory you need.  Use an SBATCH directive in the job script to set these options.

You can edit the script in several ways.

  • Click the blue Edit Files button at the top of the Jobs tab in the Jobs Composer window
  • In the Jobs tab in the Jobs Composer window, find the Submit Script section at the bottom right.  Click the blue Open Editor button.

After you save the file, the editor window remains open, but if you return to the Jobs Composer window, you will see that the content of  your script has changed.

Edit the job options

In the Jobs tab in the Jobs Composer window, click the blue Job Options button.  The options for the selected job such as name, the job script to run, and the account it will be charged to are displayed and can be edited.  Click Save or Cancel to return to the job listing.

 

Submit jobs to Bridges

Select a job in the Jobs tab in the Jobs Composer window. Click the green Submit button to submit the selected job.  A message at the top of the window shows whether the job submission was successful or not.  If it is not, you can edit the job script or options and resubmit.  When the job submits successfully, the status of the job in the Jobs Composer window will change to Queued or Running.  When  the job completes, the status will change to Completed.

 

JupyterHub and IJulia

You can run JupyterHub, and IJulia notebooks, through OnDemand.  You must do some setup before you can run IJulia;  see the Julia document for information.

  1. Select Interactive Apps >> Jupyter Notebooks from the top menu in the Dashboard window. 
  2. In the screen that opens, specify the timelimit, number of nodes, and partition to use.  You can also designate the account to charge this usage to if you have more than one grant on Bridges.

    If you will use the LM or one of the GPU partitions, you must add a flag in the Extra Args field for the amount of memory or the number and type of GPUs you want:

    --mem=numberGB

    --gres=gpu:type:number 

    See the Running jobs section of this User Guide for more information on Bridges' partitions and the options available.

  3. Click the blue Launch button to start your JupyterHub session.  You may have to wait in the queue for resources to be available.
  4. When your session starts, click the blue Connect to Jupyter button.  The Dashboard window now displays information about your JupyterHub session including which node it is running on, when it began, and how much time remains.

    A new window running JupyterHub also opens.  Note the three tabs: Files, Running and Clusters.

    Files

    By default you are in the Files tab, and it displays the contents of your Bridges home directory.  You can navigate through your home directory tree.   

    Running

    Under the Running tab, you will see listed any notebooks or terminal sessions that you are currently running.

  5. Now you can start a Jupyter or IJulia notebook:
    1. To start a Jupyter notebook, in the Files tab, click on its name.  A new window running the notebook opens.
    2. To start IJulia, in the Files tab, click on the New button at the top right of the file listing. Choose IJulia from the drop down.

 

Errors

If you get an "Internal Server Error" when starting a JupyterHub session, you may be over your home directory quota. Check the Details section of the error for a line like:

#<ActionView::Template::Error: Disk quota exceeded @ dir_s_mkdir - /home/joeuser/ondemand/data/sys/dashboard/batch_connect/sys/jupyter_app...............

You can confirm that you are over quota by opening a Bridges shell access window and typing 

du -sh

This command shows the amount of storage in your home directory.  Home directory quotas are 10GB. If du -sh shows you are near 10GB, you should delete or move some files out of your home directory.  You can do this in OnDemand in the File Explorer window or in a shell access window.  

When you are under quota, you can try starting a JupyterHub session again.

 

Stopping your JupyterHub session

In the Dashboard window, click the red Delete button. 

 

RStudio

You can run RStudio through OnDemand. 

  1. Select Interactive Apps > RStudio Server from the top menu in the Dashboard window.
  2. In the screen that opens, specify the timelimit, number of nodes, and partition to use.  You can also designate the account to charge this usage to if you have more than one grant on Bridges.

    If you will use the LM or one of the GPU partitions, you must add a flag in the Extra Args field for the amount of memory or the number and type of GPUs you want:

    --mem=numberGB

    --gres=gpu:type:number 

    See the Running jobs section of this User Guide for more information on Bridges' partitions and the options available.

  3. Click the blue Launch button to start your RStudio session.  You may have to wait in the queue for resources to be available.
  4.  When your session starts, click the blue Connect to RStudio Server button.  A new window opens with the RStudio interface.  

Errors

If you exceed the timelimit you requested when setting up your RStudio session, you will see this error:

Error: Status code 503 returned

To continue using RStudio, go to Interactive Apps > RStudio from the top menu in the Dashboard window and start a new session.

 

Stopping your RStudio session

To end your RStudio session, either select File > Quit Session or click the red icon in the upper right of your RStudio window.  NOTE that this only closes your RStudio session; it does not close your interactive Bridges session. You are still being charged for the Bridges session.  If you like, you can start another RStudio session.

To end your interactive Bridges session so that you are no longer being charged, return to the Dashboard window and click the red Delete button. 

 

Shell access

You can get shell access to Bridges by choosing Clusters > >_Bridges Shell Access from the top menus in the Dashboard window.  In the window that opens, you are logged in to one of Bridges' login nodes as if you used ssh to connect to Bridges.  

 

Miscellaneous

Accessing Bridges documentation

In the Dashboard window, under the Help menu, choose Online Documentation to be taken to the Bridges User Guide.

 

Change your PSC password

In the Dashboard window, under the Help menu, choose Change HPC Password to be taken to the PSC password change utility.

 

Using Bridges' GPU nodes

The NVIDIA Tesla K80 and P100 GPUs on Bridges provide substantial, complementary computational power for deep learning, simulations and other applications.

A standard NVIDIA accelerator environment is installed on  Bridges' GPU nodes.  If you have programmed using GPUs before, you should find this familiar.   Please contact bridges@psc.edu for more help.

GPU Nodes

There are two types of GPU nodes on Bridges: 16 nodes with NVIDIA K80 GPUs and 32 nodes with NVIDIA P100 GPUs.  

K80 nodes:  The 16 K80 GPU nodes each contain 2 NVIDIA K80 GPU cards, and each card contains two GPUs that can be individually scheduled. Ideally, the GPUs are shared in a single application to ensure that the expected amount of on-board memory is available and that the GPUs are used to their maximum capacity. This makes the K80 GPU nodes optimal for applications that scale effectively to 2, 4 or more GPUs.  Some examples are GROMACS, NAMD and VASP.  Applications using a multiple of 4 K80 GPUs will maximize system throughput.

The nodes are HPE Apollo 2000s, each with 2 Intel Xeon E5-2695 v3 CPUs (14 cores per CPU) and 128GB RAM.
 Details

P100 nodes: The 32 P100 GPU nodes contain 2 NVIDIA P100 GPU cards, and each card holds one very powerful GPU, optimally suited for single-GPU applications that require maximum acceleration.  The most prominent example of this is deep learning training using frameworks that do not use multiple GPUs.

The nodes are HPE Apollo 2000s, each with 2 Intel Xeon E5-2683 v4 CPUs (16 cores per CPU) and 128GB RAM.
 Details

File Systems

The /home and pylon5 file systems are available on all of these nodes.  See the File Spaces section of the User Guide for more information on these file systems.

Compiling and Running jobs

Use the GPU partition, either in batch or interactively, to compile your code and run your jobs.  See the Running Jobs section of the User Guide for more information on Bridges' partitions and how to run jobs. 

 

CUDA

To use CUDA, first you must load the CUDA module.  To see all versions of CUDA that are available, type:

module avail cuda

Then choose the version that you need and load the module for it.

module load cuda

loads the default CUDA.   To load a different version, use the full module name.

module load cuda/8.0

 CUDA 8 codes should run on both types of Bridges' GPU nodes with no issues.  CUDA 7 should only be used on the  K80 GPUs (Phase 1).  Performance may suffer with CUDA 7 on the P100 nodes (Phase 2).

 

OpenACC

Our primary GPU programming environment is OpenACC.

The PGI compilers are available on all GPU nodes. To set up the appropriate environment for the PGI compilers, use the  module  command:

module load pgi

Read more about the module command at PSC.  

If you will be using these compilers often, it will be useful to add this command to your shell initialization script.

There are many options available with these compilers. See the online man pages (“man pgf90”,”man pgcc”,”man pgCC”) for detailed information.  You may find these basic OpenACC options a good place to start:

pgcc –acc yourcode.c
pgf90 –acc yourcode.f90

 

P100 node users  should add the “-ta=tesla,cuda8.0” option to the compile command, for example:

pgcc -acc -ta=tesla,cuda8.0 yourcode.c
 

Adding the “-Minfo=accel” flag to the compile command (whether pgf90, pgcc or pgCC) will provide useful feedback regarding compiler errors or success with your OpenACC commands.

pgf90 -acc -Minfo=accel yourcode.f90

Hybrid MPI/GPU Jobs

To run a hybrid MPI/GPU job use the following commands for compiling your program:

module load cuda
module load mpi/pgi_openmpi
mpicc -acc yourcode.c

When you execute your program you must first issue the above two module load commands.

Profiling and Debugging

The environment variables PGI_ACC_TIME, PGI_ACC_NOTIFY and PGI_ACC_DEBUG can provide profiling and debugging information for your job. Specific commands depend on the shell you are using.
 Unix shells

Send email to bridges@psc.edu to request that additional CUDA-oriented debugging tools be installed.

 

 Bash shellC shell
Performance profiling
Enable runtime GPU performance profiling export PGI_ACC_TIME=1 setenv PGI_ACC_TIME 1
Debugging
Basic debugging.
For data transfer information, set PGI_ACC_NOTIFY to 3.
export PGI_ACC_NOTIFY=1 setenv PGI_ACC_NOTIFY 1
More detailed debugging  export PGI_ACC_DEBUG=1 setenv PGI_ACC_DEBUG 1

Hadoop and Spark

If you want to run Hadoop or Spark on Bridges, you should note that when you apply for your account.

File systems

/home

The /home file system, which contains your home directory, is available on all Bridges' Hadoop nodes.

HDFS

The Hadoop filesystem, HDFS, is available from all  Hadoop nodes. There is no explicit quota for the HDFS, but it uses your $SCRATCH disk space. Please delete any files you don't need when your job has ended. 

Files must reside in HDFS to be used in Hadoop jobs. Putting files into HDFS requires these steps:

  1. Transfer the files  to the namenode  with scp or sftp
  2. Format them for ingestion into HDFS
  3. Use the hadoop fs -put command to copy the files into HDFS.  This command distributes your data files across the cluster's datanodes. 

The hadoop fs command should be in your command path by default.

Documentation for the hadoop fs command lists other options. These options can be used to list your files in HDFS, delete HDFS files, copy files out of HDFS and other file operations.

To request the installation of data ingestion tools on the Hadoop cluster send email to bridges@psc.edu.

Accessing the Hadoop /Spark cluster 

 To start using Hadoop and Spark with Yarn and HDFS on Bridges, connect to the login node and issue the following commands:

interact -N 3 # you will need to wait until resources are allocated to you before continuing
module load hadoop
start-hadoop.sh

Your cluster will be set up and you'll be able to run hadoop and spark jobs. The cluster requires a minimum of three nodes (-N 3). Larger jobs may require a reservation.  Please contact bridges@psc.edu if you would like to use more than 8 nodes or run for longer than 8 hours.

Please note that when your job ends your HDFS will be unavailable so be sure to retrieve any data you need before your job finishes.

Web interfaces are currently not available for interactive jobs but can be made available for reservations.

 

Spark

The Spark data framework is available on Bridges. Spark, built on the HDFS filesystem,  extends the Hadoop MapReduce paradigm in several directions. It supports a wider variety of workflows than MapReduce. Most importantly, it allows you to process some or all of your data in memory if you choose. This enables very fast parallel processing of your data.

Python, Java and Scala are available for Spark applications. The pyspark interpreter is especially effective for interactive, exploratory tasks in Spark. To use Spark you must first load your data into Spark's highly efficient file structure called  Resilient Distributed Dataset (RDD).

Extensive online documentation is available at the  Spark web site. If you have questions about or encounter problems using Spark, send email to bridges@psc.edu.

Spark example using Yarn

Here is an example command to run a Spark job using yarn.  This example calculates pi using 10 iterations.

spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.11-2.1.0.jar 10

To view the full output:

yarn logs -applicationId yarnapplicationId

where yarnapplicationId is the yarn applicationId assigned by the cluster.

 

A simple Hadoop example

This section demonstrates how to run a MapReduce Java program on the Hadoop cluster. This is the standard paradigm for Hadoop jobs. If you want to run jobs using another framework or in other languages besides Java send email to bridges@psc.edu for assistance.

Follow these steps to run a job on the Hadoop cluster. All the commands listed below should be in your command path by default. The variable HADOOP_HOME should be set for you also.

  1. Compile your Java MapReduce program with a command similar to:
    hadoop com.sun.tools.javac.Main WordCount WordCount.java
  2.  

    where:

    • WordCount is the name of the output directory where you want your class file to be put
    • WordCount.java is the name of your source file
  3. Create a jar file out of your class file with a command similar to:
    jar -cvf WordCount.jar -C WordCount/ .

    where:

    • WordCount.jar is the name of your output jar file
    • WordCount is the name of the directory which contains your class file
  4. Launch your Hadoop job with the hadoop command

    Once you have your jar file you can run the  hadoop command to launch your Hadoop job. Your hadoop command will be similar to

    hadoop jar WordCount.jar org.myorg.WordCount \/datasets/compleat.txt $MYOUTPUT

    where:

    • Wordcount.jar is the name of your jar file
    • org.myorg.WordCount specifies the folder hierarchy inside your jar file. Substitute the appropriate hierarchy for your jar file.
    • /datasets/compleat.txt is the path to your input file in the HDFS file system. This file must already exist in HDFS.
    • $MYOUTPUT is the path to your output file, which will be saved in the HDFS file system. You must set this variable to the output file path before you issue the hadoop command.

After you issue the hadoop command your job is controlled by the Hadoop scheduler to run on the datanodes. The scheduler is currently a stricty FIFO scheduler. If your job turnaround is not meeting your needs send email to bridges@psc.edu

When your job finishes, the hadoop command will end and you will be returned to the system prompt.

Other Hadoop technologies

An entire ecosystem of technologies has grown up around Hadoop, such as HBase and Hive.  To request the installation of a different package send email to bridges@psc.edu.

 

 

Containers

Containers are stand-alone packages holding the the software needed to create a very specific computing environment.

 

Do I need a container?

If you need a very specialized computing environment, you can create a container as your work space on Bridges. Currently, Singularity is the only type of container supported on  Bridges. Docker is not supported.

However, in most cases, Bridges has all the software you will need.  Before creating a container for your work, check the extensive list of software that has been installed on Bridges.   While logged in to Bridges, you can also get a list of installed packages by typing

module avail

If you need a package that is not available on Bridges, you can request that it be installed by emailing bridges@psc.edu.  You can also install software packages in your own file spaces and, in some cases, we can provide assistance if you encounter difficulties.

 

How would I use a container on Bridges?

Singularity is the only container software supported on Bridges.  You can create a Singularity container, copy it to Bridges and then execute your container on Bridges, where it can use Bridges's compute nodes and filesystems. In your container you can use any software required by your application: a different version of CentOS,  a different Unix operating system, any software in any specific version needed. You can set up your Singularity container without any intervention from PSC staff.

A Singularity container is a single file. Thus, it can easily be copied to Bridges and also to other systems so you can insure that you are running a reproducible environment no matter where or when you compute. 

See the PSC documentation on Singularity for more details on its usage on Bridges.

Virtual Machines 

A Virtual Machine (VM) is a portion of a physical machine that is partitioned off through software so that it acts as an independent physical machine. 

You should indicate that you want a VM when you apply for time on Bridges.  

When you have an active Bridges' grant,  use the VM Request form to request a VM.  This form requests information about the software and hardware resources you need for your VM and your reason for requesting a VM. Your request will be evaluated by PSC staff for its suitability.You will be contacted in one business day about your request.

 

Why use a VM?

If you need a persistent environment you need to use a virtual machine (VM). Examples of a need for a persistent environment are a Web server with a database backend or just a persistent database.

If you can use a Singularity container rather than a VM you should use the container. You can set up your Singularity container yourself without any intervention by PSC staff. Also, a VM is charged for usage the entire time the VM is set up, whether or not it is being actively used. Containers are only charged for the time during which they are executing, because they are not persistent computing environments. A Singularity environment only exists while you are executing it. Thus, VMs are much more expensive in terms of SUs used than Singularity containers.

A VM provides you with control over your environment, but you will have access to the computing power, memory capacity  and file spaces of Bridges.

Common uses of VMs include hosting database and web servers.  These servers can be restricted just to you or you can open them up to outside user communities to share your work. You can also connect your database and web servers and other  processing components in a complex workflow.

VMs provide several other benefits. Since the computing power behind the VM is a supercomputer, sufficient resources are available to support multiple users.  Since each VM acts like an independent machine, user security is heightened. No outside users can violate the security of your independent VM. However, you can allow other users to access your VM if you choose. 

A VM can be customized to meet your requirements.  PSC will set up the VM and give you access to your database and web server at a level that matches your requirements.

To discuss whether a VM would be appropriate for your research project send email to bridges@psc.edu.

Downtime

VMs are affected by system downtime, and will not be available during an outage.  Scheduled downtimes are announced in advance.

Data backups

It is your responsiblity to backup any important data to another location outside of the VM. PSC will make infrequent snapshots of VMs for recovery from system failure, but cannot be responsible for managing your data.

Grant expiration

When your grant expires, your VM will be suspended.  You have a 3-month grace period to request via email to bridges@psc.edu that it be reactivated so that you can move data from the VM. Three months after your grant expires, the VM will be removed.  Please notify bridges@psc.edu if you need help moving your data during the grace period. 

 

 

 

Data Collections

A community dataset space allows Bridges users from different grants to share data in a common space.  Bridges hosts both public and private datasets, providing rapid access for individuals, collaborations and communities with appropriate protections.

Data collections are stored on pylon5, Bridges' persistent file system.  The space they use counts toward the Bridges storage allocation for the grant hosting them.

If you would like to host a data collection on Bridges, let us know what you need by completing the Community Dataset Request form. If your data collection has security or compliance requirements, please contact compliance@psc.edu.

Request a Community Dataset

 

Publicly available datasets

Some data collections are available to anyone with a Bridges' account.  They include:

Natural Languge Tool Kit Data

NLTK comes with many corpora, toy grammars, trained models, etc. A complete list of the available data is posted at: http://nltk.org/nltk_data/

Available on Bridges at /pylon5/datasets/community/nltk

 

MNIST

Dataset of handwritten digits used to train image processing systems.  

Available on Bridges at /pylon5/datasets/community/mnist

 

Genomics Data

Several genomics datasets are publicly available. 

BLAST
The BLAST databases can be accessed through the environment variable BLASTDB after loading the BLAST module.
RepBase
Repbase is the most commonly used database of repetitive DNA elements. You must register with RepBase at http://www.girinst.org and send proof of registration to genomics@psc.edu in order to use the Repbase database.
Other genomics datasets
Other available datasets are typically used with a particular genomics package.  These include: 
Barrnap /pylon5/datasets/community/genomics/barrnap 
CheckM /pylon5/datasets/community/genomics/checkm
Dammit
Dammit uniref90 
/pylon5/datasets/community/genomics/dammit
/pylon5/datasets/community/genomics/dammit_uniref90
Homer /pylon5/datasets/community/genomics/homer
Long Ranger /pylon5/datasets/community/genomics/longranger
MetaPhlAn2 /pylon5/datasets/community/genomics/metaphlan2
Prokka /pylon5/datasets/community/genomics/prokka

Gateways

Bridges hosts a number of gateways - web-based, domain-specific user interfaces to applications, functionality and resources that allow users to focus on their research rather than programming and submitting jobs.  Gateways  provide intuitive, easy-to-use interfaces to complex functionality and data-intensive workflows.

Gateways can manage large numbers of jobs and provide collaborative features, security constraints and provenance tracking, so that you can concentrate on your analyses instead of on the mechanics of accomplishing them.

 

Among the gateways implemented on Bridges are:

Galaxy, an open source, web-based platform for data intensive biomedical research.  

Researchers preparing de novo transcriptome assemblies via the popular Galaxy platform for data-intensive analysis have transparent access to Bridges, without the need to obtain their own XSEDE allocation. Bridges is ideal for rapid assembly of massive RNA sequence data.

A high-performance Trinity tool has been installed on the public Galaxy Main instance at usegalaxy.org. All Trinity jobs in workflows run from usegalaxy.org will execute transparently on large memory nodes on Bridges. These tools are free to use for open scientific research.

Trinity jobs can be run on Bridges by going to https://usegalaxy.org.

For more general information on Galaxy, see https://galaxyproject.org/.

 

SEAGrid,  the Science and Engineering Applications Grid, provides access for researchers to scientific applications across a wide variety of computing resources.  SEAGrid also helps with creating input data, producing visualizations and archiving simulation data.  

For more information on SEAGrid, see https://seagrid.org/home.

 

The Causal Web Portal, from the Center for Causal Discovery, offers easy to use software for causal discovery from large and complex biomedical datasets, applying Bayesan and constraint based algorithms. It includes a web application as well as API’s and a command line version.

For more information about the Causal Web Portal, see http://www.ccd.pitt.edu/tools/

To access the Causal Web Portal on Bridges, see https://ccd2.vm.bridges.psc.edu/ccd/login

Reporting a Problem

To report a problem on Bridges, please email bridges@psc.edu.  Please report only one problem per email; it will help us to track and solve any issues more quickly and efficiently.

Be sure to include

  • the JobID
  • the error message you received
  • the date and time the job ran
  • any other pertinent information 
  • a screen shot of the error or the output file showing the error, if possible

Acknowledgement in Publications

All publications, copyrighted or not, resulting from an allocation of computing time on Bridges should include an acknowledgement. Please acknowledge both the funding source that supported your access to PSC and the specific PSC resources that you used.

Please also acknowledge support provided by XSEDE's ECSS program and/or PSC staff when appropriate.

Proper acknowledgment is critical for our ability to solicit continued funding to support these projects and next generation hardware.

For suggested text and citations, see:

XSEDE supported research on Bridges

We ask that you use the following text:

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).

Please include these citations:

Towns, J., Cockerill, T., Dahan, M., Foster, I., Gaither, K., Grimshaw, A., Hazlewood, V., Lathrop, S., Lifka, D., Peterson, G.D., Roskies, R., Scott, J.R. and Wilkens-Diehr, N. 2014. XSEDE: Accelerating Scientific Discovery. Computing in Science & Engineering. 16(5):62-74. http://doi.ieeecomputersociety.org/10.1109/MCSE.2014.80.

Nystrom, N. A., Levine, M. J., Roskies, R. Z., and Scott, J. R. 2015. Bridges: A Uniquely Flexible HPC Resource for New Communities and Data Analytics. In Proceedings of the 2015 Annual Conference on Extreme Science and Engineering Discovery Environment (St. Louis, MO, July 26-30, 2015). XSEDE15. ACM, New York, NY, USA. http://dx.doi.org/10.1145/2792745.2792775.

   

 Download

@inproceedings{Nystrom:2015:BUF:2792745.2792775, 
author = {Nystrom, Nicholas A. and Levine, Michael J. and Roskies, Ralph Z. and Scott, J. Ray}, 
title = {Bridges: A Uniquely Flexible HPC Resource for New Communities and Data Analytics}, 
booktitle = {Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure}, 
series = {XSEDE '15}, 
year = {2015}, 
isbn = {978-1-4503-3720-5}, 
location = {St. Louis, Missouri}, 
pages = {30:1--30:8}, 
articleno = {30}, 
numpages = {8}, 
url = {http://doi.acm.org/10.1145/2792745.2792775}, 
doi = {10.1145/2792745.2792775}, 
acmid = {2792775}, 
publisher = {ACM}, 
address = {New York, NY, USA}, 
keywords = {GPU, HPC, Hadoop, applications, architecture, big data, data analytics, database, research, usability},
 }
 Download

%0 Conference Paper
%1 2792775
%A Nicholas A. Nystrom
%A Michael J. Levine
%A Ralph Z. Roskies
%A J. Ray Scott 
%T Bridges: a uniquely flexible HPC resource for new communities and data analytics
%B Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure
%@ 978-1-4503-3720-5
%C St. Louis, Missouri
%P 1-8
%D 2015
%R 10.1145/2792745.2792775
%I ACM 

Additional Support

Please also acknowledge support provided through XSEDE's Extended Collaborative Support Services (ECSS) and/or by PSC staff.

 

Other research on Bridges

For research on Bridges supported by programs other than XSEDE, such as PRCI, we ask that you use the following text:

This work used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).

Please include this citation:

Nystrom, N. A., Levine, M. J., Roskies, R. Z., and Scott, J. R. 2015. Bridges: A Uniquely Flexible HPC Resource for New Communities and Data Analytics. In Proceedings of the 2015 Annual Conference on Extreme Science and Engineering Discovery Environment (St. Louis, MO, July 26-30, 2015). XSEDE15. ACM, New York, NY, USA. http://dx.doi.org/10.1145/2792745.2792775.

   

 Download

@inproceedings{Nystrom:2015:BUF:2792745.2792775, 
author = {Nystrom, Nicholas A. and Levine, Michael J. and Roskies, Ralph Z. and Scott, J. Ray}, 
title = {Bridges: A Uniquely Flexible HPC Resource for New Communities and Data Analytics}, 
booktitle = {Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure}, 
series = {XSEDE '15}, 
year = {2015}, 
isbn = {978-1-4503-3720-5}, 
location = {St. Louis, Missouri}, 
pages = {30:1--30:8}, 
articleno = {30}, 
numpages = {8}, 
url = {http://doi.acm.org/10.1145/2792745.2792775}, 
doi = {10.1145/2792745.2792775}, 
acmid = {2792775}, 
publisher = {ACM}, 
address = {New York, NY, USA}, 
keywords = {GPU, HPC, Hadoop, applications, architecture, big data, data analytics, database, research, usability},
 }
 Download

%0 Conference Paper
%1 2792775
%A Nicholas A. Nystrom
%A Michael J. Levine
%A Ralph Z. Roskies
%A J. Ray Scott 
%T Bridges: a uniquely flexible HPC resource for new communities and data analytics
%B Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure
%@ 978-1-4503-3720-5
%C St. Louis, Missouri
%P 1-8
%D 2015
%R 10.1145/2792745.2792775
%I ACM 

Additional support

Please also acknowledge any support provided by PSC staff.

ECSS Support

To acknowledge support provided through XSEDE's Extended Collaborative Support Services (ECSS), please use this text:

We thank [consultant name(s)] for [his/her/their] assistance with [describe tasks such as porting, optimization, visualization, etc.], which was made possible through the XSEDE Extended Collaborative Support Service (ECSS) program.

Please include this citation:

Wilkins-Diehr, N and S Sanielevici, J Alameda, J Cazes, L Crosby, M Pierce, R Roskies. High Performance Computer Applications 6th International Conference, ISUM 2015, Mexico City, Mexico, March 9-13, 2015, Revised Selected Papers Gitler, Isidoro, Klapp, Jaime (Eds.) Springer International Publishing. ISBN 978-3-319-32243-8, 3-13, 2016. 10.1007/978-3-319-32243-8.

PSC Support

If PSC staff contributed substantially to software development, optimization, or other aspects of the research, they should be considered as coauthors.

When PSC staff contributions do not warrant coauthorship, please acknowledge their support with the following text:

We thank [consultant name(s)] for [his/her/their] assistance with [describe tasks such as porting, optimization, visualization, etc.]