Bridges-2 User Guide

If you are migrating from Bridges, please see the Bridges to Bridges-2 Migration Guide for tips on moving your work to Bridges-2.

 

We take security very seriously. Please take a minute now to read PSC policies on passwords, security guidelines, resource use, and privacy. You are expected to comply with these policies at all times when using PSC systems. If you have questions at any time, you can send email to help@psc.edu.

Are you new to HPC?

If you are new to high performance computing, please read Getting Started with HPC before you begin your research on Bridges-2. It explains HPC concepts which may be unfamiliar. You can also check the Introduction to Unix or the Glossary for quick definitions of terms that may be new to you.

We hope that that information along with the Bridges-2 User Guide will have you diving into your work on Bridges-2. But if you have any questions, don’t hesitate to email us for help at help@psc.edu.

Questions?

PSC User Services is here to help you get your research started and keep it on track. If you have questions at any time, you can send email to help@psc.edu.

Set your PSC password

Before you can connect to Bridges-2, you must have a PSC password.

If you have an active account on any other PSC system

PSC usernames and passwords are the same across all PSC systems. You will use the same username and password on Bridges-2 as for your other PSC account.

If you do not have an active account on any other PSC system:

You must create a PSC password. Go to the web-based PSC password change utility  at  apr.psc.edu  to set your PSC password.

PSC password policies

Computer security depends heavily on maintaining secrecy of passwords.

PSC uses Kerberos authentication on all its production systems, and your PSC password (also known as your Kerberos password) is the same on all PSC machines.

Set your initial PSC password

When you receive a PSC account, go to the web-based PSC password change utility to set your password.  For security, you should use a unique password for your PSC account, not one that you use for other sites.

Change your PSC password

Changing your password changes it on all PSC systems. To change your Kerberos password, use the web-based PSC password change utility .

Please note that changing your password on the XSEDE Portal does not change it on PSC systems and will not prevent your PSC password from expiring.

PSC password requirements

Your password must:

  • be at least eight characters long
  • contain characters from at least three of the following groups:
    • lower-case letters
    • upper-case letters
    • digits
    • special characters, excluding apostrophes (‘) and quotes (“)
  • be different from the last three PSC passwords you have used
  • be changed at least once per year

Password safety

Under NO circumstances does PSC reveal any passwords over the telephone, FAX them to any location, send them through email, set them to a requested string, or perform any other action that could reveal a password.

If someone claiming to represent PSC contacts you and requests information that in any manner would reveal a password, be assured that the request is invalid and do NOT comply.

 

System Configuration

Bridges-2 is designed for converged HPC + AI + Data. Its custom topology is optimized for data-centric HPC, AI, and HPDA (High Performance Data Analytics). An extremely flexible software environment along with community data collections and BDaaS (Big Data as a Service) provide the tools necessary for modern pioneering research. The data management system, Ocean, consists of two-tiers, disk and tape, transparently managed as a single, highly usable namespace.

Compute nodes

Bridges-2 has three types of compute nodes: “Regular Memory”, “Extreme Memory”, and GPU.

Regular Memory nodes

Regular Memory (RM) nodes provide extremely powerful general-purpose computing, pre- and post-processing, AI inferencing, and machine learning and data analytics. Most RM nodes contain 256GB of RAM, but 16 of them have 512GB.

RM nodes
Number 488 16
CPU 2 AMD EPYC 7742 CPUs
64 cores per CPU, 128 cores per node
2.25-3.40 GHz
2 AMD EPYC 7742 CPUs
64 cores per CPU, 128 cores per node
2.25-3.40 GHz
RAM 256GB 512GB
Cache 256MB L3, 8 memory channels 256MB L3, 8 memory channels
Node-local storage 3.84TB NVMe SSD 3.84TB NVMe SSD
Network Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

 

Extreme memory nodes

Extreme Memory (EM) nodes provide 4TB of shared memory for statistics, graph analytics, genome sequence assembly, and other applications requiring a large amount of memory for which distributed-memory implementations are not available.

EM nodes
Number 4
CPU 4 Intel Xeon Platinum 8260M “Cascade lake” CPUs
24 cores per CPU, 96 cores per node
2.40-3.90 GHz
RAM 4TB, DDR4-2933
Cache 37.75MB LLC, 6 memory channels
Node-local storage 7.68TB NVMe SSD
Network Mellanox ConnectX-6-HDR Infiniband 200Gb/s Adapter

 

GPU nodes

Bridges-2’s GPU nodes provide exceptional performance and scalability for deep learning and accelerated computing, with a total of 40, 960 CUDA cores and 5,120 tensor cores.  Recently, Bridges’ GPU-AI resources have been migrated to Bridges-2, adding nine more V100 GPU nodes.

GPU nodes
Number 24 9
GPUs per node 8 NVIDIA Tesla V100-32GB SXM2 8 NVIDIA V100-16GB
GPU performance 1 Pf/s tensor
CPUs 2 Intel Xeon Gold 6248 “Cascade Lake” CPUs
20 cores per CPU, 40 cores per node
2.50 – 3.90 GHz
2 Intel Xeon Gold 6148 CPUs
20 cores/CPU , 40 cores per node
2.4 – 3.7 GHz
RAM 512GB, DDR4-2933 192 GB, DDR4-2666
Interconnect NVLink PCIe
Cache 27.5MB LLC, 6 memory channels
Node-local storage 7.68TB NVMe SSD 4 NVMe SSDs, 2TB each (total 8TB)
Network 2 Mellanox ConnectX-6 HDR Infiniband 200 Gbs/s Adapters

 

Data Management

Data management on Bridges-2 is accomplished through a unified, high-performance filesystem for active project data, archive, and resilience, named Ocean.

Ocean consists of two tiers, disk and tape, transparently managed as a single, highly usable namespace.

Ocean’s disk subsystem, for active project data, is a high-performance, internally resilient Lustre parallel filesystem with 15PB of usable capacity, configured to deliver up to 129GB/s and 142GB/s of read and write bandwidth, respectively.

Ocean’s tape subsystem, for archive and additional resilience, is a high-performance tape library with 7.2PB of uncompressed capacity, configured to deliver 50TB/hour. Data compression occurs in hardware, transparently, with no performance overhead.

Connecting to Bridges-2

Bridges-2 contains two broad categories of nodes: compute nodes, which handle the production research computing, and login nodes, which are used for managing files, submitting batch jobs and launching interactive sessions. Login nodes are not suited for production computing.

When you connect to Bridges-2, you are connecting to a Bridges-2 login node. You can connect to Bridges-2 via a web browser or through a command line interface.

See the Running Jobs section of this User Guide for information on production computing on Bridges-2.

Connect in a web browser

You can access Bridges-2 through a web browser by using the OnDemand software. You will still need to understand Bridges-2’s partition structure and the options which specify job limits, like time and memory use, but OnDemand provides a more modern, graphical interface to Bridges-2.

See the OnDemand section for more information.

Connect to a command line interface

You can connect to a traditional command line interface by logging in via one of these:

  • ssh, using either XSEDE or PSC credentials. If you are registered with XSEDE for DUO Multi-Factor Authentication (MFA), you can use this security feature in connecting to Bridges-2. See the XSEDE instructions to set up DUO for MFA.
  • XSEDE Single Sign On, including using Multi-Factor authentication if you are an XSEDE user

SSH

You can use an ssh client from your local machine to connect to Bridges-2 using either your PSC or XSEDE credentials.

SSH  is a program that enables secure logins over an unsecure network. It encrypts the data passing both ways so that if it is intercepted it cannot be read.

SSH is client-server software, which means that both the user’s local computer and the remote computer must have it installed. SSH server software is installed on all the PSC machines. You must install SSH client software on your local machine.

Free ssh clients for  Macs, Windows machines and many versions of Unix are available. Popular ssh clients (GUI) include PuTTY for Windows and Cyberduck for Macs. A command line version of ssh is installed on Macs  by default; if you prefer that, you can use it in the Terminal application. You can also check with your university to see if there is an ssh client that they recommend.

Once you have an ssh client installed, you can use either your PSC credentials or XSEDE credentials (optionally with DUO MFA) to connect to Bridges-2. Note that you must have created your PSC password before you can use ssh to connect to Bridges-2.

Use ssh to connect to Bridges-2 using XSEDE credentials and (optionally) DUO MFA:
  1. Using your ssh client, use your XSEDE credentials and connect to hostname bridges2.psc.edu  using port 2222.
    ssh -p 2222 xsede-username@bridges2.psc.edu
  2. Enter your XSEDE password when prompted.
  3. (Optional) If you are registered with XSEDE DUO, you will receive a prompt on your device.  Once you have approved it, you will be logged in.
Use ssh to connect to Bridges-2 using PSC credentials:
  1. Using your ssh client, connect to hostname bridges2.psc.xsede.org  or bridges2.psc.edu  using the default port (22).
    Either hostname will connect you to Bridges-2. You do not have to specify the port.
  2. Enter your PSC username and password when prompted.

Read more about using SSH to connect to PSC systems

Public-private keys

You can also use public-private key pairs to connect to Bridges-2. To do so, you must first fill out this form to register your keys with PSC.

XSEDE single sign on

XSEDE users can use their XSEDE usernames and passwords in the XSEDE User Portal Single Sign On Login Hub (SSO Hub) to access bridges2.psc.xsede.org  or bridges2.psc.edu.

You must use DUO Multi-Factor Authentication in the SSO Hub.

See the XSEDE instructions to set up DUO for Multi-Factor Authentication.

Account Administration

 

Changing your PSC password

There are two ways to change or reset your PSC password:

When you change your PSC password, whether you do it via the online utility or via the kpasswd command on one PSC system, you change it on all PSC systems.

See PSC password policies.

Remember that your PSC password is separate from your XSEDE Portal password. Resetting one password does not change the other password.

 

The projects command

The projects command will help you monitor your allocation on Bridges-2. You can determine what Bridges-2 resources you have been allocated, your remaining balance, your account id (used to track usage), and more. The output below shows that this user has an allocation on Bridges-2 Regular Memory and Bridges-2 GPU resources for computing and Bridges-2 Ocean for file storage.

[userid@login018 ~]$ projects
Your default charging project charge id  is abcd1234.  If you would like to change
the default charging project  use the command change_primary_group ~charge_id~. 
Use the charge id listed below for the project you would like to make the default 
in place of ~charge_id~

Project: ABCD1234
     PI: Cy Entist
  Title: World Renowned Research

      Resource: Bridges 2 GPU 
    Allocation: 10000.00
       Balance: 8872.00
      End Date: 2021-07-15
  Award Active: Yes
   User Active: Yes
     Charge ID: abcd1234
   Directories:
       HOME /jet/home/userid 
      Resource: Bridges 2 Regular Memory
    Allocation: 23000.00
       Balance: 197450.00
      End Date: 2021-07-15
  Award Active: Yes
   User Active: Yes
     Charge ID: abcd1234
   Directories:
       HOME /jet/home/userid

      Resource: Bridges 2 Ocean Storage
    Allocation: 200000.00
       Balance: 90735.36
      End Date: 2021-07-15
  Award Active: Yes
   User Active: Yes
     Charge ID: abcd1234
   Directories:
       HOME /jet/home/userid
       STORAGE /ocean/projects/abcd1234
       STORAGE /ocean/projects/abcd1234/userid
       STORAGE /ocean/projects/abcd1234/shared

Accounting for Bridges-2 use

Accounting for Bridges-2 use varies with the type of node used, which is determined by the type of allocation you have: “Bridges-2 Regular Memory”, for Bridges-2’s RSM (256 and 512GB) nodes); “Bridges-2 Extreme Memory”, for  Bridges-2 4TB nodes; and “Bridges-2 GPU” and “Bridges-2 GPU-AI”, for Bridges-2’s V100 GPU nodes.

For all allocations and all node types, usage is defined in terms of “Service Units” or SUs.  The definition of an SU varies with the type of node being used.

Bridges-2 Regular Memory

The  RM nodes are allocated as “Bridges-2 Regular Memory”.  This does not include Bridges-2’s GPU nodes.  Each RM node has 128 cores, each of which can be allocated separately. Service Units (SUs) are defined in terms of “core-hours”: the use of one core for 1 hour.

1 core-hour = 1 SU

Because the RM nodes each hold 128 cores, if you use one entire RM node for one hour, 128 SUs will be deducted from your allocation.

128 cores x 1 hour =128 core-hours = 128 SUs

If you don’t need all 128 cores, you can use just part of an RM node by submitting to the RM-shared partition. See more about the partitions on Bridges-2 below.

Using the RM-shared partition, if you use 2 cores on a node for 30 minutes, 1 SU will be deducted from your allocation.

2 cores x 0.5 hours = 1 core-hour = 1 SU

Bridges-2 Extreme Memory

The 4TB nodes on Bridges-2 are allocated as “Bridges-2 Extreme Memory”.  Accounting is done by the cores requested for the job. Service Units (SUs) are defined in terms of “core-hours”: the use of 1 core for one hour.

1 core-hour = 1 SU

If your job requests one node (96 cores) and runs for 1 hour, 96 SUs will be deducted from your allocation.

1 node x 96 cores/node x 1 hour = 96 core-hours = 96 SUs

If your job requests 3 nodes and runs for 6 hours, 1728 SUs will be deducted from your allocation.

3 nodes x 96 cores/node x 6 hours = 1728 core-hours = 1728 SUs

Bridges-2 GPU

Bridges-2 Service Units (SUs) for GPU nodes are defined in terms of “gpu-hours”: the use of one GPU Unit for one hour.

These nodes hold 8 GPU units each, each of which can be allocated separately.  Service Units (SUs) are defined in terms of GPU-hours.

1 GPU-hour = 1 SU

If you use an entire V100 GPU node for one hour, 8 SUs will be deducted from your allocation.

8 GPU units/node x 1 node x 1 hour = 8 gpu-hours = 8 SUs

If you don’t need all 8 GPUs, you can use just part of a GPU node by submitting to the GPU-shared partition. See more about the partitions on Bridges-2 below.

If you use the GPU-shared partition and use 4 GPU units for 48 hours, 196 SUs will be deducted from your allocation.

4 GPU units x 48 hours = 196 gpu-hours = 196 SUs

Accounting for file space

Every Bridges-2 grant has a storage allocation associated with it on the Bridges-2 file system, Ocean.  There are no SUs deducted from your allocation for the space you use, but if you exceed your storage quota, you will not be able to submit jobs to Bridges-2.

Each grant has a Unix group associated with it. Every file is “owned” by a Unix group, and that file ownership determines which grant is charged for the file space.  See “Managing multiple grants” for a further explanation of Unix groups, and how to manage file ownership if you have more than one grant.

You can check your Ocean usage with the projects command.

[userid@bridges2-login ~]$ projects
....
      Resource: Bridges 2 Ocean Storage
    Allocation: 30000.00
       Balance: 18945.36
      End Date: 2021-07-15
  Award Active: Yes
   User Active: Yes
     Charge ID: abcd1234
   Directories:
       HOME /jet/home/userid
       STORAGE /ocean/projects/instalrs
       STORAGE /ocean/projects/instalrs/userid
       STORAGE /ocean/projects/instalrs/shared
....

Managing multiple grants

If you have multiple grants on Bridges-2, you should ensure that the work you do for each grant is assigned correctly to that grant. The files created under or associated with that grant should belong to it, to make them easier to find and use by others on the same grant.

There are two ids associated with each grant for these purposes: a SLURM account id and a Unix group id.

SLURM account ids determine which grant your Bridges-2 (computational) use is deducted from.

Unix group ids determine which Ocean allocation the storage space for files is deducted from, and who owns and can access files or directories.

For a given grant, the SLURM account id and the Unix group id are identical strings.

One of your grants has been designated as your default grant, and the account id and Unix group id associated with that grant are your default account id and default Unix group id.

When a Bridges-2 job runs, any SUs it uses are deducted from the default grant.  Any files created by that job are owned by the default Unix group.

Find your default account id and Unix group

To find your SLURM account ids, use the projects command.  It will display all the grants you belong to.  It will also list your default SLURM account id (called charge id in the projects output) at the top. Your default Unix group id is an identical string.

In this example, the user has two grants with SLURM account ids account-1 and account-2.  The default account id is account-2.

[userid@bridges2-login ~]$ projects 
Your default charging project charge id  is account-2.  If you would like to change the default charging project use the 
command change_primary_group ~charge_id~. Use the charge id listed below for the project you would like to make 
the default in place of ~charge_id~  
Project: AAA000000A      
     PI: My Principal Investigator
  Title: Important Research

       Resource: BRIDGES GPU
     Allocation: 37,830.00
        Balance: 17,457.19
       End Date: 2030-07-15
   Award Active: Yes
    User Active: Yes
      Charge ID: account-1
    Directories:
        HOME  /jet/home/userid
. . .

 Project: AAA111111A
      PI: My Other PI
   Title: Another Important Research Project 
       Resource: BRIDGES REGULAR MEMORY
     Allocation: 57,500.00 
        Balance: 12,474.99
       End Date: 2019-06-15
   Award Active: Yes
    User Active: Yes 
      Charge ID: account-2
    *** Default charging project ***
    Directories:
        HOME /jet/home/userid
........

Use a secondary (non-default) grant

To  use a grant other than your default grant on Bridges-2, you must specify the appropriate account id  with the -A option to the SLURM sbatch command.   See the Running Jobs section of this Guide for more information on batch jobs, interactive sessions and SLURM.

NOTE that using the -A option does not change your default Unix group. Any files created during a job are owned by your default Unix group, no matter which account id is used for the job, and the space they use will be deducted from the Ocean allocation for the default Unix group.

Change your Unix group for a login session

To temporarily change your Unix group, use the newgrp command. Any files created subsequently during this login session will be owned by the new group you have specified.  Their storage will be deducted from the Ocean allocation of the new group. After logging out of the session, your default Unix group will be in effect again.

newgrp unix-group

NOTE that the newgrp command has no effect on the account id in effect.  Any Bridges-2 usage will be deducted from the default account id or the one specified with the -A option to sbatch.

Change your default account id and Unix group permanently

You can permanently change your default account id and your default Unix group id with the change_primary_group command.  Type:

change_primary_group -l

to see all your groups.  Then type

change_primary_group account-id

to set account-id as your default.

Your default account id changes immediately.  Bridges-2 use by any batch jobs or interactive sessions following this command are deducted from the new account by default.

Your default Unix group does not change immediately.  It takes about an hour for the change to take effect.  You must log out and log back in after that window for the new Unix group to be the default.

Tracking your usage

There are several ways to track your Bridges-2 usage: thexdusage command, the projects command, and the Grant Management System.

The xdusage command displays project and user account usage information for XSEDE projects. Type man xdusage on Bridges-2 for information.

The projects  command shows information on all Bridges-2 grants, including usage and the Ocean directories associated with the grant.

For more detailed accounting data you can use the Grant Management System.   You can also track your usage through the XSEDE User Portal. The xdusage and projects commands and the XSEDE Portal accurately reflect the impact of a Grant Renewal but the Grant Management System currently does not.

Managing your XSEDE allocation

Most account management functions for your XSEDE grant are handled through the XSEDE User Portal. You can search the XSEDE user portal to get  help. Some common questions:

Changing your default shell

The change_shell command allows you to change your default shell. This command is only available on the login nodes.

To see which shells are available, type

change_shell -l

To change your default shell, type

change_shell newshell

where newshell is one of the choices output by the change_shell -l command. You must log out and back in again for the new shell to take effect.

PSC account policies

The policies documented here are evaluated regularly to assure adequate and responsible administration of PSC systems for users. As such, they are subject to change at any time.

File Retention

PSC provides storage resources, for long-term storage and file management.

Files in a PSC storage system are retained for 3 months after the affiliated grant has expired.

Requesting a Refund

When appropriate, PSC provides refunds for jobs that failed due to circumstances beyond your control.

To request a refund, contact a PSC consultant or email help@psc.edu. In the case of batch jobs, we require the standard error and output files produced by the job. These contain information needed in order to refund the job.

File Spaces

There are several distinct file spaces available on Bridges-2, each serving a different function.

  • $HOME, your home directory on Bridges-2
  • $PROJECT, persistent file storage on Ocean. $PROJECT is a larger space than $HOME.
  • $LOCAL, Scratch storage on local disk on the node running a job
  • $RAMDISK, Scratch storage in the local memory associated with a running job

File expiration

Three months after your grant expires all of your Bridges-2 files associated with that grant will be deleted, no matter which file space they are in. You will be able to login during this 3-month period to transfer files, but you will not be able to run jobs or create new files.

File permissions

Access to files in any Bridges-2 space is governed by Unix file permissions. If  your data has additional security or compliance requirements, please contact compliance@psc.edu.

Unix file permissions

For detailed information on Unix file protections, see the man page for the chmod (change mode) command.

To share files with your group, give the group read and execute access for each directory from your top-level directory down to the directory that contains the files you want to share.

chmod g+rx directory-name

Then give the group read and execute access to each file you want to share.

chmod g+rx filename

To give the group the ability to edit or change a file, add write access to the group:

chmod g+rwx filename

Access Control Lists

If you want more fine-grained control than Unix file permissions allow —for example, if you want to give only certain members of a group access to  a file, but not all members—then you need to use Access Control Lists (ACLs). Suppose, for example, that you want to give janeuser read access to a file in a directory, but no one else in the group.

Use the setfacl (set file acl) command to give janeuser read and execute access on the directory:

setfacl -m user:janeuser:rx directory-name

for each directory from your top-level directory down to the directory that contains the file you want to share with janeuser. Then give janeuser access to a specific file with

setfacl -m user:janeuser:r filename

User janeuser will now be able to read this file, but no one else in the group will have access to it.

To see what ACLs are set on a file, use the getfacl (get file acl) command.

There are man pages for chmod, setfacl and getfacl.

$HOME

This is your Bridges-2 home directory. It is the usual location for your batch scripts, source code and parameter files. Its path is /jet/home/username, where  username is your PSC username. You can refer to your home directory with the environment variable $HOME. Your home directory is visible to all of Bridges-2’s nodes.

Your home directory is backed up daily, although it is still a good idea to store copies of your important  files in another location, such as the Ocean file system or on a local file system at your site. If you need to recover a home directory file from backup send email to help@psc.edu. The process of recovery will take 3 to 4 days.

$HOME quota

Your home directory has a 25GB quota. You can check your home directory usage using the my_quotas command. To improve the access speed to your home directory files you should stay as far below your home directory quota as you can.

Grant expiration

Three months after a grant expires, the files in your home directory associated with that grant will be deleted.

$PROJECT

The path of your Ocean home directory is /ocean/projects/groupname/username, where groupname is the Unix group id associated with your grant. Use the id command to find your group name.

The command id -Gn will list all the Unix groups you belong to.

The command id -gn will list the Unix group associated with your current session.

If you have more than one grant, you will have a $PROJECT directory for each grant. Be sure to use the appropriate directory when working with multiple grants.

$PROJECT quota

Your usage quota for each of your grants is the Ocean storage allocation you received when your proposal was approved. If your total use in Ocean exceeds this quota you won’t be able to run jobs on Bridges-2 until you are under quota again.

Use the my_quotas  or projects command to check your Ocean usage. You can also check your usage on the XSEDE User Portal.

If you have multiple grants, it is very important that you store your files in the correct $PROJECT directory.

Grant expiration

Three months after a grant expires, the files in any Ocean directories associated with that grant will be deleted.

$LOCAL

Each of Bridges-2’s nodes has a local file system attached to it. This local file system is only visible to the node to which it is attached, and provides fast access to local storage.

In a running job, this file space is available as $LOCAL.

If your application performs a lot of small reads and writes, then you could benefit from using this space.

Node-local storage is only available when your job is running, and can only be used as working space for a running job. Once your job finishes, any files written to $LOCAL are inaccessible and deleted. To use local space, copy files to it at the beginning of your job and back out to a persistent file space before your job ends.

If a node crashes all the node-local files are lost. You should checkpoint theses files by copying them to Ocean during long runs.

$LOCAL size

The maximum amount of local space varies by node type.

To check on your local file space usage type:

du -sh

No Service Units accrue for the use of $LOCAL.

Using $LOCAL

To use $LOCAL you must first copy your files to $LOCAL at the beginning of your script, before your executable runs. The following script is an example of how to do this

RC=1
 n=0
 while [[ $RC -ne 0 && $n -lt 20 ]]; do
     rsync -aP $sourcedir $LOCAL/
     RC=$?
     let n = n + 1
     sleep 10
 done

Set $sourcedir to point to the directory that contains the files to be copied before you call your executable. This code will try at most 20 times to copy your files. If it succeeds, the loop will exit. If an invocation of rsync was unsuccessful, the loop will try again and pick up where it left off.

At the end of your job you must copy your results back from $LOCAL or they will be lost. The following script will do this.

mkdir $PROJECT/results
 RC=1
 n=0
 while [[ $RC -ne 0 && $n -lt 20 ]]; do
     rsync -aP $LOCAL/ $PROJECT/results
     RC=$?
     let n = n + 1
     sleep 10
 done

This code fragment copies your files to a directory in your Ocean file space named results, which you must have created previously with the mkdir command. It will loop at most 20 times and stop if it is successful.

$RAMDISK

You can use the memory allocated for your job for IO rather than using disk space. In a running job, the environment variable $RAMDISK will refer to the memory associated with the nodes in use.

The amount of memory space available to you depends on the size of the memory on the nodes and the number of nodes you are using. You can only perform IO to the memory of nodes assigned to your job.

If you do not use all of the cores on a node, you are allocated memory in proportion to the number of cores you are using. Note that you cannot use 100% of a node’s memory for IO; some is needed for program and data usage.

This space is only available to you while your job is running, and can only be used as working space for a running job. Once your job ends this space is inaccessible and files there are deleted. To use $RAMDISK, copy files to it at the beginning of your job and back out to a permanent space before your job ends. If your job terminates abnormally, files in $RAMDISK are lost.

Within your job you can cd to $RAMDISK, copy files to and from it, and use it to open files.  Use the command du -sh to see how much space you are using.

If you are running a multi-node job the $RAMDISK variable points to the memory space on the node that is running your rank 0 process.

 

Transferring Files

Several methods are available to transfer files into and from Bridges-2.

 

Paths for Bridges-2 file spaces

To copy files into any of your Bridges-2 spaces, you need to know the path to that space on Bridges-2. The start of the full paths for your Bridges-2 directories are:

Home directory     /jet/home/username

Ocean directory   /ocean/projects/groupname/username

To find your groupname, use the command id -Gn. All of your valid groupnames will be listed. You have an Ocean directory for each grant you have.

 

Transfers into your Bridges-2 home directory

Your home directory quota is 25GB. More space is available in your $PROJECT file space in Ocean. Exceeding your home directory quota will prevent you from writing more data into your home directory and will adversely impact other operations you might want to perform.

 

Commands to transfer files

You can use rsync, scp, sftp or Globus to copy files to and from Bridges-2.

rsync

You can use the rsync command to copy files to and from Bridges-2. A sample rsync command to copy to a Bridges-2 directory is

rsync -rltpDvp -e 'ssh -l username' source_directory data.bridges2.psc.edu:target_directory

Substitute your username for ‘username‘. Make sure you use the correct groupname in your target directory. By default, rsync will not copy older files with the same name in place of newer files in the target directory. It will overwrite older files.

We recommend the rsync options -rltDvp. See the rsync man page for information on these options and other options you might want to use. We also recommend the option

-oMACS=umac-64@openssh.com

If you use this option, your transfer will use a faster data validation algorithm.

You may want to put your rsync command in a loop to insure that it completes. A sample loop is

RC=1
 n=0
 while [[ $RC -ne 0 && $n -lt 20 ]] do
     rsync source-file target-file
     RC = $?
     let n = n + 1
     sleep 10
 done

This loop will try your rsync command 20 times. If it succeeds it will exit. If an rsync invocation is unsuccessful the system will try again and pick up where it left off. It will copy only those files that have not already been transferred. You can put this loop, with your rsync command, into a batch script and run it with sbatch.

scp

To use scp for a file transfer you must specify a source and destination for your transfer. The format for either source or destination is

username@machine-name:path/filename

For transfers involving Bridges-2,  username is your PSC username. The machine-name should be given as data.bridges2.psc.edu. This is the name for a high-speed data connector at PSC. We recommend using it for all file transfers using scp involving Bridges-2. Using it prevents file transfers from disrupting interactive use on Bridges-2’s login nodes.

File transfers using scp must specify full paths for Bridges-2 file systems. See Paths for Bridges-2 file spaces for details.

sftp

To use sftp, first connect to the remote machine:

sftp username@machine-name

When  Bridges-2 is the remote machine, use your PSC userid as  username. The Bridges-2 machine-name should be specified as data.bridges2.psc.edu. This is the name for a high-speed data connector at PSC.  We recommend using it for all file transfers using sftp involving Bridges-2. Using it prevents file transfers from disrupting interactive use on Bridges-2’s login nodes.

You will be prompted for your password on the remote machine. If Bridges-2 is the remote machine, enter your PSC password.

You can then enter sftp subcommands, like put to copy a file from the local system to the remote system, or get to copy a file from the remote system to the local system.

To copy files into Bridges-2, you must either cd to the proper directory or use full pathnames in your file transfer commands. See Paths for Bridges-2 file spaces for details.

Transferring files using Two-Factor Authentication

If you are required to use Two-Factor Authentication (TFA) to access Bridges-2’s filesystems, you must enroll in XSEDE DUO. Once that is complete, use scp or sftp to transfer files to and from Bridges-2.

TFA users must use port 2222 and XSEDE Portal usernames and passwords, not PSC usernames and passwords. The machine name for these transfers is data.bridges2.psc.edu.

In the examples below, myfile is the local filename, XSEDE-username is your XSEDE Portal username and /path/to/file is the full path to the file on a Bridges-2 filesystem. Note that -P (capital P) is necessary.

scp with TFA

Transfer a file from a local machine to Bridges-2:

scp -P 2222 myfile XSEDE-username@data.bridges2.psc.edu:/path/to/file

Transfer a file from Bridges-2 to a local machine:

scp -P 2222 XSEDE-username@data.bridges2.psc.edu:/path/to/file myfile

sftp

Interactive sftp with TFA
sftp -P 2222 XSEDE-username@data.bridges2.psc.edu

Then use the put command to copy a file from the local machine to Bridges-2, or the get command to transfer a file from Bridges-2 to the local machine.

Graphical SSH client

If you are using a graphical SSH client, configure it to connect to data.bridges2.psc.edu on port 2222/TCP. Login using your XSEDE Portal username and password.

Globus

Globus can be used for any file transfer to Bridges-2. It tracks the progress of the transfer and retries when there is a failure; this makes it especially useful for transfers involving large files or many files.

To use Globus to transfer files you must authenticate either via a Globus account or with InCommon credentials.

To use a Globus account for file transfer, set up a Globus account at the Globus site.

To use InCommon credentials to transfer files to/from Bridges-2, you must first provide your CILogin Certificate Subject information to PSC. Follow these steps:

  1. Find your Certificate Subject string
    1. Navigate your web browser to https://cilogon.org/.
    2. Select your institution from the ‘Select an Identity Provider’ list.
    3. Click the ‘Log On’ button.  You will be taken to the web login page for your institution.
    4. Login with your username and password for your institution.
      • If your institution has an additional login requirement (e.g., Duo), authenticate to that as well.
    5. After successfully authenticating to your institution’s web login interface, you will be returned to the CILogon webpage.
    6. Click on the Certificate Information drop down link to find the ‘Certificate Subject’.
  2. Send your Certificate Subject string to PSC
    1. In the CILogon webpage, select and copy the Certificate Subject text. Take care to get the entire text string if it is broken up onto multiple lines.
    2. Send email to support@psc.edu.  Paste your Certificate Subject field into the message, asking that it be mapped to your PSC username.

Your CILogin Certificate Subject information will be added within one business day, and you will be able to begin transferring files to and from Bridges-2.

Globus endpoints

Once you have the proper authentication you can initiate file transfers from the Globus site. A Globus transfer requires a Globus endpoint, a file path and a file name for both the source and destination. The endpoints for Bridges-2 are:

  • psc#bridges2-xsede if you are using an XSEDE User Portal account for authentication
  • psc#bridges2-cilogon if you are using InCommon for authentication

These endpoints are owned by psc@globusid.org. If you use DUO MFA for your XSEDE authentication, you do not need to because you cannot use it with Globus. You must always specify a full path for the Bridges-2 file systems. See Paths for Bridges-2 file spaces for details.

Programming Environment

Bridges-2 provides a rich programming environment for the development of applications.

C, C++ and Fortran

AMD (AOCC), Intel, Gnu and NVIDIA HPC compilers for C, C++ and Fortan are available on Bridges-2.  Be sure to load the module for the compiler set that you want to use.  Once the module is loaded,  you will have access to the compiler commands:

Compiler command for
Module name C C++ Fortran
AMD aocc clang clang flang
Intel intel icc icpc ifort
Gnu gcc gcc g++ gfortran
NVIDIA nvhpc nvcc nvc++ nvfortran

 

There are man pages for each of the compilers.

See also:

OpenMP programming

To compile OpenMP programs you must add an option to your compile command:

Compiler Option
Intel -qopenmp
for example: icc -qopenmp yprog.c
Gnu -fopenmp
for example: gcc -fopenmp myprog.c
NVIDIA -mp
for example: nvcc -mp myprog.c

See also:

MPI programming

Three types of MPI are supported on Bridges-2: MVAPICH2, OpenMPI and Intel MPI. The three MPI types  may perform differently on different problems or in different programming environments. If you are having trouble with one type of MPI, please try using another type. Contact help@psc.edu for more help.

To compile an MPI program, you must:

  • load the module for the compiler that you want
  • load the module for the MPI type you want to use – be sure to choose one that uses the compiler that you are using.   The module name will distinguish between compilers.
  • issue the appropriate MPI wrapper command to compile your program

To run your previously compiled MPI program, you must load the same MPI module that was used in compiling.

To see what MPI versions are available, type module avail mpi  or module avail mvapich2. Note that the module names include the MPI family and version (“openmpi/4.0.2”),  followed by the associated compiler and version (“intel20.4”).  (Modules for other software installed with MPI are also shown.)

 

Wrapper commands

 

To use the Intel compilers with Load an intel module plus Compile with this wrapper command
C C++ Fortran
Intel MPI mpi-intel

mpiicc

note the “ii”

mpiicpc

note the “ii”

mpiifort

note the “ii”

OpenMPI openmpi/version-intelversion mpicc mpicxx mpifort
MVAPICH2 mvapich2/version-intelversion mpicc code.c -lifcore mpicxx code.cpp -lifcore mpifort code.f90 -lifcore

 

To use the Gnu compilers with Load a gcc module plus Compile with this command
C C++ Fortran
OpenMPI openmpi/version-gccversion mpicc mpicxx mpifort
MVAPICH2 mvapich2/version-gccversion mpicc mpicxx mpifort

 

To use the NVIDIA compilers with Load an nvhpc module plus Compile with this command
C C++ Fortran
OpenMPI openmpi/version-nvhpcversion mpicc mpicxx mpifort
MVAPICH2 Not available

Custom task placement with Intel MPI

If you wish to specify custom task placement with Intel MPI (this is not recommended),  you must set the environment variable I_MPI_JOB_RESPECT_PROCESS_PLACEMENT to 0. Otherwise the mpirun task placement settings you give will be ignored. The command to do this is:

For the BASH shell:

export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0

For the CSH shell:

setenv I_MPI_JOB_RESPECT_PROCESS_PLACEMENT 0

See also:

Other languages

Other languages, including Java, Python, R,  and MATLAB, are available. See the software page for information.

Debugging and performance analysis

DDT is a debugging tool for C, C++ and Fortran 90 threaded and parallel codes. It is client-server software. Install the client on your local machine and then you can access the GUI on Bridges-2 to debug your code.

See the DDT page for more information.

Software

Bridges-2 has a broad collection of applications installed. See the list of software installed on Bridges-2.

Additional software may be installed by request. If you feel that you need particular software for your research, please send a request to help@psc.edu.

Running Jobs

All production computing must be done on Bridges-2's compute nodes, NOT on Bridges-2's login nodes. The SLURM scheduler (Simple Linux Utility for Resource Management) manages and allocates all of Bridges-2's compute nodes. Several partitions, or job queues, have been set up in SLURM to allocate resources efficiently.

To run a job on Bridges-2, you need to decide how you want to run: interactively, in batch, or through OnDemand;  and where to run - that is, which partitions you are allowed to use.

What are the different ways to run a job?

You can run jobs in Bridges-2 in several ways:

  • interactive sessions - where you type commands and receive output back to your screen as the commands complete
  • batch mode - where you first create a batch (or job) script which contains the commands to be run, then submit the job to be run as soon as resources are available
  • through OnDemand - a browser interface that allows you to run interactively, or create, edit and submit batch jobs and also provides a graphical interface to tools like RStudio, Jupyter notebooks, and IJulia,  More information about OnDemand is in the OnDemand section of this user guide.

Regardless of which way you choose to run your jobs, you will always need to choose a partition to run them in.

Which partitions can I use?

Different partitions control different types of Bridges-2's resources; they are configured by the type of node they control, along with other job requirements like how many nodes or how much time or memory is needed.  Your access to the partitions is based on the type of Bridges-2 allocation that you have: "Bridges-2 Regular Memory", "Bridges-2 Extreme Memory",  “Bridges-2 GPU", or "Bridges-2 GPU-AI". You may have more than one type of allocation; in that case, you will have access to more than one set of partitions.

You can see which of Bridges-2's resources that you have been allocated with the projects command. See section "The projects command" in the Account Administration section of this User Guide for more information.

Interactive sessions

You can do your production work interactively on Bridges-2, typing commands on the command line, and getting responses back in real time.  But you must  be allocated the use of one or more Bridges-2's compute nodes by SLURM to work interactively on Bridges-2.  You cannot use Bridges-2's login nodes for your work.

You can run an interactive session in any of the RM or GPU partitions.  You will need to specify which partition you want, so that the proper resources are allocated for your use.  You cannot run an interactive session in the EM partition. 

If all of the resources set aside for interactive use are in use, your request will wait until the resources you need are available. Using a shared partition (RM-shared, GPU-shared) will probably allow your job to start sooner.

The interact command

To start an interactive session, use the command interact.  The format is:

interact -options

The simplest interact command is

interact

This command will start an interactive job using the defaults for interact, which are

Partition: RM-shared

Cores: 1

Time limit: 60 minutes

Once the interact command returns with a command prompt you can enter your commands. The shell will be your default shell. When you are finished with your job, type CTRL-D.

[user@bridges2-loginr01 ~]$ interact
A command prompt will appear when your session begins
"Ctrl+d" or "exit" will end your session
[user@r004 ~]

Notes:

  • Be sure to use the correct account id for your job if you have more than one grant. See "Managing multiple grants".
  • Service Units (SU) accrue for your resource usage from the time the prompt appears until you type CTRL-D, so be sure to type CTRL-D as soon as you are done.
  • The maximum time you can request is 8 hours. Inactive interact jobs are logged out after 30 minutes of idle time.
  • By default, interact uses the RM-shared partition.  Use the -p option for interact to use a different partition.

Options for interact

If you want to run in a different partition, use more than one core or set a different time limit, you will need to use options to the interact command.   Available options are given below.

 

Option Description Default value
-p partition
Partition requested RM-small
-t HH:MM:SS

Walltime requested

The maximum time you can request is 8 hours.

60:00 (1 hour)
-N n
Number of nodes requested 1
--ntasks-per-node=n
Note the "--" for this option
Number of cores to allocate per node 1
-n NTasks
Number of tasks spread over all nodes N/A
--gres=gpu:type:n
Note the "--" for this option

Specifies the type and number of GPUs requested.

Valid choices for 'type' are "v100-16" and "v100-32". See the GPU partitions section of this User Guide for an explanation of the GPU types.

Valid choices for 'n'  are 1-8

N/A
-A account id

SLURM account id for the job

Find or change your default account id

Note: Files created during a job will be owned by the Unix group in effect when the job is submitted. This may be different than the account id for the job. See the discussion of the newgrp command in the Account Administration section of this User Guide to see how to change the Unix group currently in effect.

Your default account id
-R reservation-name

Reservation name, if you have one

Use of -R does not automatically set any other interact options. You still need to specify the other options (partition, walltime, number of nodes) to override the defaults for the interact command. If your reservation is not assigned to your default account, then you will need to use the -A option when you issue your interact command.

N/A
-h
Help, lists all the available command options  N/A

See also

 

Batch jobs

Instead of working interactively on Bridges-2, you can instead run in batch. This means you will

  • create a file called a batch or job script
  • submit that script to a partition (queue) using the sbatch command
  • wait for the job's turn in the queue
  • if you like, check on the job's progress as it waits in the partition and as it is running
  • check the output file for results or any errors when it finishes

A simple example

This section outlines an example which submits a simple batch job. More detail on batch scripts, the sbatch command and its options follow.

Create a batch script

Use any editor you like to create your batch scripts. A simple batch script named hello.job which runs a "hello world" command is given here. Comments, which begin with '#', explain what each line does.

The first line of any batch script must indicate the shell to use for your batch job.

#!/bin/bash
# use the bash shell
set -x 
# echo each command to standard out before running it
date
# run the Unix 'date' command
echo "Hello world, from Bridges-2!"
# run the Unix 'echo' command

 

Submit the batch script to a partition

Use the sbatch command to submit the hello.job script.

[joeuser@login005 ~]$ sbatch hello.job
Submitted batch job 7408623

Note the jobid that is echoed back to you when the job is submitted.  Here it is 7408623.

Check on the job progress

You can check on the job's progress in the partition by using the squeue command. By default you will get a list of all running and queued jobs. Use the -u option with your username to see only your jobs.  See the squeue command for details.

[joeuser@login005 ~]$ squeue -u joeuser
 JOBID   PARTITION NAME     USER    ST TIME NODES NODELIST(REASON)
 7408623 RM        hello.jo joeuser PD 0:08 1     r7320:00

The status "PD" (pending) in the output here shows that job 7408623 is waiting in the queue.  See more about the squeue command below.

When the job is done, squeue will no longer show it:

[joeuser@login005 ~]$ squeue -u joeuser
 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

 

Check the output file when the job is done

By default, the standard output and error from a job are saved in a file with the name slurm-jobid.out, in the directory that the job was submitted from.

[joeuser@login005 ~]$ more slurm-7408623.out
+ date
Sun Jan 19 10:27:06 EST 2020
+ echo 'Hello world, from Bridges-2!'
Hello world, from Bridges-2!
[joeuser@login005 ~]$

 

The sbatch command

To submit a batch job, use the sbatch command.  The format is

sbatch -options batch-script

The options to sbatch can either be in your batch script or on the sbatch command line.  Options in the command line override those in the batch script.

Note:

  • Be sure to use the correct account id  if you have more than one grant. Please see the -A option for sbatch to change the SLURM account id for a job. Information on how to determine your valid account ids and change your default account id is in the Account adminstration section of this User Guide.
  • In some cases, the options for sbatch differ from the options for interact or srun.
  • By default, sbatch submits jobs to the RM partition.  Use the -p option for sbatch to direct your job to a different partition

Options to the sbatch command

For more information about these options and other useful sbatch options see the sbatch man page.

Option Description Default
-p partition
Partition requested RM
-t HH:MM:SS
Walltime requested in HH:MM:SS 30 minutes
-N n
Number of nodes requested. 1
-n n
Number of cores requested in total. None
--ntasks-per-node=n
Note the "--" for this option
Request n cores be allocated per node. 1
-o filename
Save standard out and error in filename. This file will be written to the directory that the job was submitted from. slurm-jobid.out
--gres=gpu:type:n
Note the "--" for this option

Specifies the number of GPUs requested.

'type' specifies the type of GPU you are requesting. Valid types are "v100-16"and  v100-32". See the GPU partitions section of this User Guide for information on the GPU types.'n' is the number of GPUs.  Valid choices are 1-8.

N/A
-A account id

SLURM account id for the job. If not specified, your default account id is used.  Find your default SLURM account id.

Note: Files created during a job will be owned by the Unix group in effect when the job is submitted. This may be different than the account id used by the job. See the discussion of the newgrp command in the Account Administration section of this User Guide to see how to change the Unix group currently in effect.

Your default account id
-C constraints

Specifies constraints which the nodes allocated to this job must satisfy.

 

Valid constraints are:

PERF
Turns on performance profiling. For use with performance profiling software like VTune, TAU

 See the discussion of the -C option in the sbatch man page for more information.

N/A
--res reservation-name
Note the "--" for this option
Use the reservation that has been set up for you.  Use of --res does not automatically set any other options. You still need to specify the other options (partition, walltime, number of nodes) that you would in any sbatch command.  If your reservation is not assigned to your default account then you will need to use the -A option to sbatch to specify the account. N/A
--mail-type=type
Note the "--" for this option
Send email when job events occur, where type can be BEGIN, END, FAIL or ALL. N/A
--mail-user=username
Note the "--" for this option
User to send email to as specified by -mail-type. Default is the user who submits the job. N/A
-d=dependency-list

Set up dependencies between jobs, where dependency-list can be:

after:job_id[:jobid...]
This job can begin execution after the specified jobs have begun execution.
afterany:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated.
aftercorr:job_id[:jobid...]
A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero).
afternotok:job_id[:jobid...]
This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
afterok:job_id[:jobid...]
This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).
singleton
This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.
N/A
--no-requeue
Note the "--" for this option
Specifies that your job will be not be requeued under any circumstances. If your job is running on a node that fails it will not be restarted. Note the "--" for this option. N/A
--time-min=HH:MM:SS
Note the "--" for this option.

Specifies a minimum walltime for your job in HH:MM:SS format.

SLURM considers the walltime requested when deciding which job to start next. Free slots on the machine are defined by the number of nodes and how long those nodes are free until they will be needed by another job. By specifying a minimum walltime you allow the scheduler to reduce your walltime request to your specified minimum time when deciding whether to schedule your job. This could allow your job to start sooner.

If you use this option your actual walltime assignment can vary between your minimum time and the time you specified with the -t option. If your job hits its actual walltime limit, it will be killed. When you use this option you should checkpoint your job frequently to save the results obtained to that point.

N/A
-h
Help, lists all the available command options

See also

 

Managing multiple grants

If you have more than one grant, be sure to use the correct SLURM account id and Unix group when running jobs.

See "Managing multiple grants" in the Account Administration section of this User Guide to see how to find your account ids and Unix groups and determine or change your defaults.

Permanently change your default SLURM account id and Unix group

See the change_primary_group command in the "Managing multiple grants" in the Account Administration section of this User Guide to permanently change your default SLURM account id and Unix group.

Temporarily change your SLURM account id or Unix group

See the -A option to the sbatch or interact commands to set the SLURM account id for a specific job.

The newgrp command will change your Unix group for that login session only. Note that any files created by a job are owned by the Unix group in effect when the job is submitted, which is not necessarily the same as the account id used for the job.  See the newgrp command in the Account Administration section of this User Guide to see how to change the Unix group currently in effect.

 Bridges-2 partitions

Each SLURM partition manages a subset of Bridges-2's resources.  Each partition allocates resources to interactive sessions, batch jobs, and OnDemand sessions that request resources from it.

Not all partitions may be open to you. Your Bridges-2 allocations determine which partitions you can submit jobs to.

A "Bridges-2 Regular Memory" allocation allows you to use Bridges-2's RM (256 and 512GB) nodes.   The RM, RM-shared and RM-512 partitions handle jobs for these nodes.

A "Bridges-2 Extreme Memory" allocation allows you to use  Bridges-2’s 4TB EM nodes.  The EM partition handles jobs for these nodes.

A "Bridges-2 GPU" or "Bridges-2 GPU-AI" allocation allows you to use Bridges-2's GPU nodes. The GPU and GPU-shared partitions handle jobs for these nodes.

All the partitions use FIFO scheduling. If the top job in the partition will not fit, SLURM will try to schedule the next job in the partition. The scheduler follows policies to ensure that one user does not dominate the machine. There are also limits to the number of nodes and cores a user can simultaneously use. Scheduling policies are always under review to ensure best turnaround for users.

RM, RM-shared and RM-512 partitions

Use the appropriate account id for your jobs: If you have more than one Bridges-2 grant, be sure to use the correct SLURM account id for each job.  See “Managing multiple grants”.

For information on requesting resources and submitting  jobs see the discussion of the interact or sbatch commands.

Jobs in the RM and RM-shared partitions run on Bridges-2’s 256GB RM  nodes.  Jobs in the RM-512 partition run on Bridges-2’a 512GB RM nodes.

  • Jobs in the RM partition use one or more full nodes. However, the memory space of  all the nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.
  • Jobs in the RM-shared partition use only part of one node. Because SUs are calculated using how many cores are used, using only part of a node will result in a smaller SU charge.
  • Jobs in the RM-512 partition can use one or more full 512GB nodes. These nodes cannot be shared.
RM partition

When submitting a job to the RM partition, you can request:

  • the number of  nodes
  • the walltime limit

If you do not specify the number of nodes or time limit, you will get the defaults.  See the summary table for the RM partition below for the defaults.

Jobs in the RM partition are charged for all 128 cores on every node they use. For a job using one node, that is 128 SUs per hour. If you do not need 128 cores, you can use the RM-shared partition to request only the number of cores that you need. This will reduce the SU charges and your job may begin earlier.

Sample interact command for the RM partition

An example of an interact command for the RM partition, requesting the use of 2 nodes for 30 minutes is

interact -p RM -N 2 -t 30:00

where:

-p indicates the intended partition

-N is the number of nodes requested

-t is the walltime requested in the format HH:MM:SS

Sample sbatch command for the RM partition

An example of a sbatch command to submit a job to the RM partition, requesting one node for 5 hours is

sbatch -p RM -t 5:00:00 -N 1 myscript.job

where:

-p indicates the intended partition

-t is the walltime requested in the format HH:MM:SS

-N is the number of nodes requested

myscript.job is the name of your batch script

 

Sample job script for the RM partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node=128

# type 'man sbatch' for more information and options
# this job will ask for 1 full RM node (128 cores) for 5 hours
# this job would potentially charge 640 RM SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

# run a pre-compiled program which is already in your project space

./a.out
RM-shared partition

When submitting a job to the RM-shared partition, you can request:

  • the number of  cores
  • the walltime limit

If you do not specify the number of cores or time limit, you will get the defaults.  See the summary table for the RM-shared partition below for the defaults.

 

Sample interact command for the RM-shared partition

An example of an interact command for the RM-shared partition, requesting the use of 64 cores for 30 minutes is

interact -p RM-shared --ntasks-per-node=64 -t 30:00

where:

-p indicates the intended partition

-N is the number of nodes requested

-t is the walltime requested in the format HH:MM:SS

Sample sbatch command for the RM-shared partition

An example of a sbatch command to submit a job to the RM-shared partition, requesting 32 cores for 5 hours is

sbatch -p RM-shared -t 5:00:00 --ntasks-per-node=32 myscript.job

where:

-p indicates the intended partition

-t is the walltime requested in the format HH:MM:SS

–ntasks-per-node is the number of cores requested

myscript.job is the name of your batch script

 

Sample job script for the RM-shared partition
#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node=64

# type 'man sbatch' for more information and options
# this job will ask for 64 cores in RM-shared and 5 hours of runtime
# this job would potentially charge 320 RM SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

# run a pre-compiled program which is already in your project space

./a.out

 

RM-512 partition

When submitting a job to the RM-512 partition, you can request:

  • the number of  nodes
  • the walltime limit

If you do not specify the number of nodes or time limit, you will get the defaults.  See the summary table for the RM partitions below for the defaults.

Sample interact command for the RM-512 partition

An example of an interact command for the RM-512 partition, requesting the use of 2 nodes for 45 minutes is

interact -p RM-512 -N 2 -t 45:00

where:

-p indicates the intended partition

-N is the number of nodes requested

-t is the walltime requested in the format HH:MM:SS

Sample sbatch command for the RM-512 partition

An example of a sbatch command to submit a job to the RM-512 partition, requesting one node for 2 1/2 hours is

sbatch -p RM-512 -t 2:30:00 -N 1 myscript.job

where:

-p indicates the intended partition

-t is the walltime requested in the format HH:MM:SS

-N is the number of nodes requested

myscript.job is the name of your batch script

 

Sample job script for the RM-512 partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-512
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node=128

# type 'man sbatch' for more information and options
# this job will ask for 1 full RM 512GB node (128 cores) for 5 hours
# this job would potentially charge 640 RM SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

# run a pre-compiled program which is already in your project space

./a.out

Summary of partitions for Bridges-2 RM nodes

RM RM-shared RM-512
Node RAM 128GB 128GB 512GB
Node count default 1 NA 1
Node count max 50 NA 2
Core count default 128 64 128
Core count max 6400 256
Walltime default 1 hour 1 hour 1 hour
Walltime max 48 hours 48 hours 48 hours

Partitions for “Bridges-2 Extreme Memory” allocations

The EM partition should be used for “Bridges-2 Extreme Memory” allocations.

Use the appropriate account id for your jobs: If you have more than one Bridges-2 grant, be sure to use the correct SLURM account id for each job.  See “Managing multiple grants”.

For information on requesting resources and submitting  jobs see the discussion of the interact or sbatch commands.

Jobs in the EM partition run on Bridges-2’s EM  nodes, with 4TB of memory.

Jobs in the EM partition must use at least 24 cores.

EM jobs can use more than one node. However, the memory space of  all the nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

When submitting a job to the EM partition, you can request:

  • the number of  cores
  • the walltime limit

Your job will be allocated memory in proportion to the number of cores you request. Be sure to request enough cores to be allocated the memory that your job needs. Memory is allocated at about 1TB per 24 cores. As an example, if your job needs 2TB of memory, you should request 48 cores.

If you do not specify the number of cores or time limit, you will get the defaults.  See the summary table for the EM partition below for the defaults.

Sample sbatch command for the EM partition

An example of a sbatch command to submit a job to the EM partition, requesting an entire node for 5 hours is

sbatch -p EM -t 5:00:00 --ntasks-per-node=96 myscript.job

where:

-p indicates the intended partition

-t is the walltime requested in the format HH:MM:SS

--ntasks-per-node is the number of cores requested per node

myscript.job is the name of your batch script

Sample job script for the EM partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p EM
#SBATCH -t 5:00:00
#SBATCH -n 96 # type 'man sbatch' for more information and options # this job will ask for 1 full EM node (96 cores) and 5 hours of runtime # this job would potentially charge 480 EM SUs # echo commands to stdout set -x # move to working directory # this job assumes: # - all input data is stored in this directory # - all output should be stored in this directory # - please note that groupname should be replaced by your groupname # - username should be replaced by your username # - path-to-directory should be replaced by the path to your directory where the executable is cd /ocean/projects/groupname/username/path-to-directory #run pre-compiled program which is already in your project space ./a.out

Summary of the EM partition

EM partition
Node 96 cores/node
4TB/node
Core default None
Core min 24
Core max 96 (1 node)
Walltime default 1 hour
Walltime max 120 hours (5 days)
Memory 1TB / 24 cores
~40GB per core

GPU and GPU-shared partitions

Jobs in the GPU and GPU-shared partitions run on the GPU nodes and are available for Bridges-2 GPU and GPU-AI allocations .

For information on requesting resources and submitting  jobs see the interact or sbatch commands.

Use the appropriate account id for your jobs: If you have more than one Bridges-2 grant, be sure to use the correct SLURM account id for each job. See “Managing multiple grants”.

Jobs in the GPU partition can use more than one node. Jobs in the GPU partition do not share nodes, so jobs are allocated all the cores and all of the GPUs associated with the nodes assigned to them . Your job will incur SU costs for all of the cores on your assigned nodes. The memory space across nodes is not integrated. The cores within a node access a shared memory space, but cores in different nodes do not.

Jobs in the GPU-shared partition use only part of one node. Because SUs are calculated using how many gpus are used, using only part of a node will result in a smaller SU charge.

GPU types

Bridges-2 has two types of GPU nodes, called "v100-32", and v100-16".

    • There are 24 v100-32 nodes have eight V100 GPUs, each with  32GB of GPU memory. These nodes have 512GB RAM per node.
    • There are 9 v100-16 nodes containing eight V100 GPUs, each with 16GB of GPU memory. These nodes have 192GB RAM per node.

All node types can be used in every GPU partition.

The GPU partition

The GPU partition is for jobs that will use one or more entire GPU nodes.

When submitting a job to the GPU partition, you must use one of these options to specify the total number of GPUs you want, where n is the number of GPUS per node you are requesting. For the GPU partition,  n  must always be 8, because you will get the entire node.

  • for an interactive session, use --gres=gpu:type:n
  • for a batch job, use --gpus=type:n

where type is one of "v100-16" or v100-32".

You can also request

  • the number of nodes
  • the walltime limit

See the sbatch command options for more details on the --gpus option.You should also specify the walltime limit.

Sample interact command for the GPU partition

An interact command to start a GPU job on 2 GPU v100-32 nodes for 30 minutes is

interact -p GPU --gres=gpu:v100-32:8 -N 2 -t 30:00

where:

-p indicates the intended partition
--gres=gpu:v100-32:8 requests the use of 8 GPUs on each v100-32 node
-N 2 requests the use of 2 nodes
-t is the walltime requested in the format HH:MM:SS

 

Sample sbatch command for the GPU partition

A sample sbatch command to submit a job to the GPU partition to use 1 full GPU v100-16 node (all 8 gpus) for 5 hours is

sbatch -p GPU -N 1 --gpus=v100-16:8 -t 5:00:00 jobname

where:

-p indicates the intended partition
-N 1 requests one v100-16 GPU node
--gpus=v100-16:8  requests the use of all 8 GPUs on a v100-16 node
-t is the walltime requested in the format HH:MM:SS
jobname is the name of your batch script

 

Sample job script for the GPU partition

 

#!/bin/bash
#SBATCH -N 1
#SBATCH -p GPU
#SBATCH -t 5:00:00
#SBATCH --gpus=v100-32:8

#type 'man sbatch' for more information and options
#this job will ask for 1 full v100-32 GPU node(8 V100 GPUs) for 5 hours
#this job would potentially charge 40 GPU SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

#run pre-compiled program which is already in your project space

./gpua.out

The GPU-shared partition

The GPU-shared partition is for jobs that will use part of one GPU node. You can request at most 4 GPUs from one node in the GPU-shared partition.

When submitting a job to the GPU-shared partition, you must use one of these options to specify the type of GPU node and the total number of GPUs you want, where type indicates what kind of node you want, and n is the number of GPUS you are requesting.

  • for an interactive session, use --gres=gpu:type:n
  • for a batch job, use --gpus=type:n

You can also request

  • the walltime limit

See the GPU partitions section of this User Guide for information on the types of GPU nodes on Bridges-2.

Sample interact command for the GPU-shared partition

An interact command to start a GPU-shared job using 4 v100-32 GPUs for 30 minutes is

interact -p GPU-shared --gres=gpu:v100-32:4 -t 30:00

where:

-p indicates the intended partition
–gres=gpu:v100-32:4  requests the use of 4 GPUs on a v100-32 GPU node
-t is the walltime requested in the format HH:MM:SS

 

Sample sbatch command for the GPU-shared partition

A sample sbatch command to submit a job to the GPU-shared partition to use 2 v100-16 GPUs for 2 hours is

sbatch -p GPU-shared  --gres=gpu:v100-16:2 -t 2:00:00

where:

-p indicates the intended partition
–gpus=v100-16:2  requests the use of 2 GPUs on a v100-16 node
-t is the walltime requested in the format HH:MM:SS

 

Sample job script for the GPU-shared partition
#!/bin/bash
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH -t 5:00:00
#SBATCH --gpus=v100-32:4

#type 'man sbatch' for more information and options
#this job will ask for 4 V100 GPUs on a v100-32 node in GPU-shared for 5 hours
#this job would potentially charge 20 GPU SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

#run pre-compiled program which is already in your project space

./gpua.out

Summary of partitions for GPU nodes

GPU GPU-shared
Default number of nodes 1 NA
Max nodes/job NA
Default number of GPUs 8 1
Max GPUs/job 64 4
Default runtime 1 hour 1 hour
Max runtime 48 hours 48 hours

Node, partition, and job status information

sinfo

The sinfo command displays information about the state of Bridges-2's nodes. The nodes can have several states:

alloc Allocated to a job
down Down
drain Not available for scheduling
idle Free
resv Reserved
More information

squeue

The squeue command displays information about the jobs in the partitions. Some useful options are:

-j jobid Displays the information for the specified jobid
-u username Restricts information to jobs belonging to the specified username
-p partition Restricts information to the specified partition
-l (long) Displays information including:  time requested, time used, number of requested nodes, the nodes on which a job is running, job state and the reason why a job is waiting to run.
More information
  • squeue man page for a discussion of the codes for job state, for why a job is waiting to run, and more options.

scancel

The scancel command is used to kill a job in a partition, whether it is running or still waiting to run.  Specify the jobid for the job you want to kill.  For example,

scancel 12345

kills job # 12345.

More information

sacct

The sacct command can be used to display detailed information about jobs. It is especially useful in investigating why one of your jobs failed. The general format of the command is:

sacct -X -j nnnnnn -S MMDDYY --format parameter1,parameter2, ...
  • For 'nnnnnn' substitute the jobid of the job you are investigating.
  • The date given for the -S option is the date at which sacct begins searching for information about your job.
  • The commas between the parameters in the --format option cannot be followed by spaces.

The --format option determines what information to display about a job. Useful parameters are

  • JobID
  • Partition
  • Account - the account id
  • ExitCode - useful in determining why a job failed
  • State - useful in determining why a job failed
  • Start, End, Elapsed - start, end and elapsed time of the job
  • NodeList - list of nodes used in the job
  • NNodes - how many nodes the job was allocated
  • MaxRSS - how much memory the job used
  • AllocCPUs - how many cores the job was allocated
 More information

job_info

The job_info command provides information on completed jobs.  It will display cores and memory allocated and SUs charged for the job.  Options to job_info can be used to get additional information, like the exit code, number of nodes allocated, and more.

Options for sinfo are:

-slurm, adds all slurm info for the job level as sacct output
--steps, adds all slurm info for the job and all job steps (this can be a LOT of output)

[joeuser@br012 ~]$ /opt/packages/allocations/bin/job_info 5149_24
CoresAllocated: 96
EndTime: 2021-01-06T14:32:00.000Z
GPUsAllocated: 0
JobId: 5149_24
MaxTaskMemory_MB: 1552505.0
MemoryAllocated_MB: 4128000
Project: abc123
StartTime: 2021-01-06T13:07:14.000Z
State: COMPLETED
SuCharged: 0.0
SuUsed: 135.627
Username: joeuser

Using the -slurm option will provide this output IN ADDTION:

[joeuser@br012 ~]$ /opt/packages/allocations/bin/job_info --slurm 5149_24

*** Slurm SACCT data ***
Account: abc123
AllocCPUS: 96
AllocNodes: 1
AllocTRES: billing=96,cpu=96,mem=4128000M,node=1
AssocID: 234
CPUTime: 5-15:37:36
CPUTimeRAW: 488256
Cluster: bridges2
DBIndex: 10092
DerivedExitCode: 0:0
Elapsed: 01:24:46
ElapsedRaw: 5086
Eligible: 2021-01-06T02:27:34
End: 2021-01-06T14:32:00
ExitCode: 0:0
Flags: SchedMain
GID: 15312
Group: abc123
JobID: 5149_24
JobIDRaw: 5196
JobName: run_velveth_gcc10.2.0_96threads_ocean.sbatch
NCPUS: 96
NNodes: 1
NodeList: e002
Partition: EM
Priority: 4294900776
QOS: lm
QOSRAW: 4
ReqCPUS: 96
ReqMem: 4128000Mn
ReqNodes: 1
ReqTRES: billing=96,cpu=96,node=1
Reserved: 10:39:40
ResvCPU: 42-15:28:00
ResvCPURAW: 3684480
Start: 2021-01-06T13:07:14
State: COMPLETED
Submit: 2021-01-06T02:27:33
Suspended: 00:00:00
SystemCPU: 52:13.643
Timelimit: 06:00:00
TimelimitRaw: 360
TotalCPU: 3-15:06:51
UID: 19178
User: joeuser
UserCPU: 3-14:14:37
WCKeyID: 0
WorkDir: /ocean/projects/abc123/joeuser/velvet

 

Monitoring memory usage

It can be useful to find the memory usage of your jobs. For example, you may want to find out if memory usage was a reason a job failed.

You can determine a job's memory usage whether it is still running or has finished. To determine if your job is still running, use the squeue command.

squeue -j nnnnnn -O state

where nnnnnn is the jobid.

For running jobs: srun and top or sstat

You can use the srun and top commands to determine the amount of memory being used.

srun --jobid=nnnnnn top -b -n 1 | grep userid

For nnnnnn substitute the jobid of your job. For 'userid' substitute your userid. The RES field in the output from top shows the actual amount of memory used by a process. The top man page can be used to identify the fields in the output of the top command.

  • See the man pages for srun and top for more information.

You can also use the sstat command to determine the amount of memory being used in a running job

sstat -j nnnnnn.batch --format=JobID,MaxRss

where nnnnnn is your jobid.

More information

See the man page for sstat for more information.

For jobs that are finished: sacct or job_info

If you are checking within a day or two after your job has finished you can issue the command

sacct -j nnnnnn --format=JobID,MaxRss

If this command no longer shows a value for MaxRss, use the job_info command

job_info nnnnnn | grep max_rss

Substitute your jobid for nnnnnn in both of these commands.

More information

Sample Batch Scripts

Both sample batch scripts for some popular software packages and sample batch scripts for general use on Bridges-2 are available.

For more information on how to run a job on Bridges-2, what partitions are available, and how to submit a job, see the Running Jobs section of this user guide.

Sample batch scripts for popular software packages

Sample scripts for some popular software packages are available on Bridges-2 in the directory /opt/packages/examples.  There is a subdirectory for each package, which includes the script along with input data that is required and typical output.

See the documentation for a particular package for more information on using it and how to test any sample scripts that may be available.

Sample batch scripts for common types of jobs

Sample Bridges-2 batch scripts for common job types are given below.

Note that in each sample script:

  • The bash shell is used, indicated by the first line ‘!#/bin/bash’.  If you use a different shell some Unix commands will be different.
  • For username and groupname you must substitute your username and your appropriate Unix group.

Sample scripts are available for

Sample batch script for a job in the RM partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node=128

# type 'man sbatch' for more information and options
# this job will ask for 1 full RM node (128 cores) for 5 hours
# this job would potentially charge 640 RM SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

# run a pre-compiled program which is already in your project space

./a.out

Sample script for a job in the RM-shared partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-shared
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node=64

# type 'man sbatch' for more information and options
# this job will ask for 64 cores in RM-shared and 5 hours of runtime
# this job would potentially charge 320 RM SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

# run a pre-compiled program which is already in your project space

./a.out

 

Sample batch script for a job in the RM-512 partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM-512
#SBATCH -t 5:00:00
#SBATCH --ntasks-per-node=128

# type 'man sbatch' for more information and options
# this job will ask for 1 full RM 512GB node (128 cores) for 5 hours
# this job would potentially charge 640 RM SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

# run a pre-compiled program which is already in your project space

./a.out

Sample batch script for a job in the EM partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p EM
#SBATCH -t 5:00:00
#SBATCH -n 96 # type 'man sbatch' for more information and options # this job will ask for 1 full EM node (96 cores) and 5 hours of runtime # this job would potentially charge 480 EM SUs # echo commands to stdout set -x # move to working directory # this job assumes: # - all input data is stored in this directory # - all output should be stored in this directory # - please note that groupname should be replaced by your groupname # - username should be replaced by your username # - path-to-directory should be replaced by the path to your directory where the executable is cd /ocean/projects/groupname/username/path-to-directory #run pre-compiled program which is already in your project space ./a.out

Sample batch script for a job in the GPU partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p GPU
#SBATCH -t 5:00:00
#SBATCH --gpus=v100-32:8

#type 'man sbatch' for more information and options
#this job will ask for 1 full v100-32 GPU node(8 V100 GPUs) for 5 hours
#this job would potentially charge 40 GPU SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

#run pre-compiled program which is already in your project space

./gpua.out

Sample batch script for a job in the GPU-shared partition

#!/bin/bash
#SBATCH -N 1
#SBATCH -p GPU-shared
#SBATCH -t 5:00:00
#SBATCH --gpus=v100-32:4

#type 'man sbatch' for more information and options
#this job will ask for 4 V100 GPUs on a v100-32 node in GPU-shared for 5 hours
#this job would potentially charge 20 GPU SUs

#echo commands to stdout
set -x

# move to working directory
# this job assumes:
# - all input data is stored in this directory
# - all output should be stored in this directory
# - please note that groupname should be replaced by your groupname
# - username should be replaced by your username
# - path-to-directory should be replaced by the path to your directory where the executable is

cd /ocean/projects/groupname/username/path-to-directory

#run pre-compiled program which is already in your project space

./gpua.out

OnDemand

The OnDemand interface allows you to conduct your research on Bridges-2 through a web browser. You can manage files – create, edit and move them – submit and track jobs, see job output, check the status of the queues, run a Jupyter notebook through JupyterHub and more, without logging in to Bridges-2 via traditional interfaces.

OnDemand was created by the Ohio Supercomputer Center (OSC). In addition to this document, you can check the extensive documentation for OnDemand created by OSC, including many video tutorials, or email help@psc.edu.

Information about using OnDemand to connect to Bridges-2 is coming soon.

Start OnDemand

To connect to Bridges-2 via OnDemand, point your browser to https://ondemand.bridges2.psc.edu.

  • You will be prompted for a username and password.  Enter your PSC username and password.
  • The OnDemand Dashboard will open.  From this page, you can use the menus across the top of the page to manage files and submit jobs to Bridges-2.

To end your OnDemand session, choose Log Out at the top right of the Dashboard window and close your browser.

 

Manage files

To create, edit or move files, click on the Files menu from the Dashboard window. A dropdown menu will appear, listing all your file spaces on Bridges-2: your home directory and the Ocean directories for each of your Bridges-2 grants.

Choosing one of the file spaces opens the File Explorer in a new browser tab. The files in the selected directory are listed.  No matter which directory you are in, your home directory is displayed in a panel on the left.

There are two sets of buttons in the File Explorer.

Buttons on the top left just below the name of the current directory allow you to View, Edit, Rename, Download, Copy or Paste (after you have moved to a different directory) a file, or you can toggle the file selection with (Un)Select All.

 

Buttons on the top of the window on the right perform these functions:

Go To Navigate to another directory or file system
Open in Terminal Open a terminal window on Bridges-2 in a new browser tab
New File Creates a new empty file
New Dir Create a new subdirectory
Upload Copies a file from your local machine to Bridges-2
Show Dotfiles Toggles the display of dotfiles
Show Owner/Mode Toggles the display of owner and permisson settings

 

 

Create and edit jobs

You can create new job scripts, edit existing scripts, and submit those scripts to Bridges-2 through OnDemand.

From the top menus in the Dashboard window, choose Jobs > Job Composer. A Job Composer window will open.

There are two tabs at the top: Jobs and Templates.

In the Jobs tab, a listing of your previous jobs is given.

 

Create a new job script

To create a new job script:

  1. Select a template to begin with
  2. Edit the job script
  3. Edit the job options

Select a template

  1. Go to the Jobs tab in the Jobs Composer window. You have been given a default template, named Simple Sequential Job.
  2. To create a new job script,  click the blue New Job > From Default Template button in the upper left. You will see a green message at the top of the window, “Job was successfully created”.

At the right of the Jobs window, you will see the Job Details, including the location of the script and the script name (by default, main_job.sh). Under that, you will see the contents of the job script in a section titled Submit Script.

Edit the job script

Edit the job script so that it has the commands and workflow that you need.

If you do not want the default settings for a job, you must include options to change them in the job script. For example, you may need more time or more than one node. For the GPU partitions, you must specify the number of GPUs per node that you want. Use an SBATCH directive in the job script to set these options.

There are two ways to edit the job script: using the Edit Files button or the Open Editor button. First, go to the Jobs tab in the Jobs Composer window.

Find the blue Edit Files tab at the top of the window

 

Find the Submit Script section at the bottom right.  Click the blue Open Editor button.

In either case, an Editor window opens. Make the changes you want and click the blue Save button.

After you save the file, the editor window remains open, but if you return to the Jobs Composer window, you will see that the content of  your script has changed.

Edit the job options

In the Jobs tab in the Jobs Composer window, click the blue Job Options button.

The options for the selected job such as name, the job script to run, and the account to run it under are displayed and can be edited. Click Reset to revert any changes you have made. Click Save or Back to return to the job listing (respectively saving or discarding your edits).

Submit jobs to Bridges-2

Select a job in the Jobs tab in the Jobs Composer window. Click the green Submit button to submit the selected job. A message at the top of the window shows whether the job submission was successful or not.  If it is not, you can edit the job script or options and resubmit. When the job submits successfully, the status of the job in the Jobs Composer window will change to Queued or Running. When  the job completes, the status will change to Completed.


GPU nodes

Bridges-2’s GPU nodes provide substantial, complementary computational power for deep learning, simulations and other applications.

A standard NVIDIA accelerator environment is installed on  Bridges-2’s GPU nodes. If you have programmed using GPUs before, you should find this familiar. Please contact help@psc.edu for more help.

The GPU nodes on Bridges-2 are available to those with a Bridges-2 “GPU” or “GPU-AI” allocation. You can see which of Bridges-2’s resources that you have been allocated with the projects command. See “The projects command” section in the Account Administration section of this User Guide for more information.

Hardware description

See the System configuration section of this User Guide for hardware details for all GPU node types.  Note that soon after Bridges is decommissioned, its GPU-AI resources will be migrated to Bridges-2.  Watch for an announcement.

 

File Systems

The $HOME (/jet/home) and Ocean file systems are available on all of these nodes.  See the File Spaces section of this User Guide for more information on these file systems.

Compiling and Running jobs

After your codes are compiled, use the GPU partition, either in batch or interactively, to run your jobs. See the Running Jobs section of this User Guide for more information on Bridges-2’s partitions and how to run jobs.

CUDA

More information on using CUDA on Bridges-2 can be found in the CUDA document.

To use CUDA, first you must load the CUDA module. To see all versions of CUDA that are available, type:

module avail cuda

Then choose the version that you need and load the module for it.

module load cuda

loads the default CUDA.   To load a different version, use the full module name.

module load cuda/8.0

OpenACC

Our primary GPU programming environment is OpenACC.

The NVIDIA compilers are available on all GPU nodes. To set up the appropriate environment for the NVIDA compilers, use the  module  command:

module load nvhpc

Read more about the module command at PSC.

If you will be using these compilers often, it will be useful to add this command to your shell initialization script.

There are many options available with these compilers. See the online NVIDIA documentation for detailed information.  You may find these basic OpenACC options a good place to start:

nvcc –acc yourcode.c  
nvfortran –acc yourcode.f90

Adding the “-Minfo=accel” flag to the compile command (whether nvfortran, nvcc or nvc++) will provide useful feedback regarding compiler errors or success with your OpenACC commands.

nvfortran -acc -Minfo=accel yourcode.f90

Hybrid MPI/GPU Jobs

To run a hybrid MPI/GPU job use the following commands for compiling your program. Use module spider cuda and  module spider openmpi to see what the module versions are.

module load cuda
module load openmpi/version-nvhpc-version  
mpicc -acc yourcode.c

When you execute your program you must first issue the above two module load commands.

Profiling and Debugging

For CUDA codes, use the command line profiler nvprof. See the CUDA document for more information.

For OpenACC codes, the environment variables NV_ACC_TIME, NV_ACC_NOTIFY and NV_ACC_DEBUG can provide profiling and debugging information for your job. Specific commands depend on the shell you are using.

Bash shell C shell
Performance profiling
Enable runtime GPU performance profiling export NV_ACC_TIME=1 setenv NV_ACC_TIME 1
Debugging
Basic debugging
For data transfer information, set PGI_ACC_NOTIFY to 3
export NV_ACC_NOTIFY=1 setenv NV_ACC_NOTIFY 1
More detailed debugging export NV_ACC_DEBUG=1 setenv NV_ACC_DEBUG 1

Containers

Containers are stand-alone packages holding the software needed to create a very specific computing environment. If you need a very specialized environment, you can create  your own container or use one that is already installed on Bridges-2. Singularity is the only type of container supported on  Bridges-2.

Creating a container

Singularity is the only container software supported on Bridges-2. You can create a Singularity container, copy it to Bridges-2 and then execute your container on Bridges-2, where it can use Bridges-2’s compute nodes and filesystems. In your container you can use any software required by your application: a different version of CentOS,  a different Unix operating system, any software in any specific version needed. You can install your Singularity container without any intervention from PSC staff.

See the PSC documentation on Singularity for more details on producing your own container and Singularity use on Bridges-2.

However, Bridges-2 may have all the software you will need.  Before creating a container for your work, check the extensive list of software that has been installed on Bridges-2.   While logged in to Bridges-2, you can also get a list of installed packages by typing

module avail

If you need a package that is not available on Bridges-2 you can request that it be installed by emailing help@psc.edu.  You can also install software packages in your own file spaces and, in some cases, we can provide assistance if you encounter difficulties.

Publicly available containers on Bridges-2

We have installed many containers from the NVIDIA GPU Cloud (NGC) on Bridges-2. These containers are fully optimized, GPU-accelerated environments for AI, machine learning and HPC. They can only be used on the Bridges-2 GPU nodes.

These include containers for:

  • Caffe and Caffe2
  • Microsoft Cognitive Toolkit
  • DIGITS
  • Inference Server
  • MATLAB
  • MXNet
  • PyTorch
  • Tensorflow
  • TensorRT
  • Theano
  • Torch

See the PSC documentation on Singularity for more details on Singularity use on Bridges-2.

Public Datasets

A community dataset space allows Bridges-2’s users from different grants to share data in a common space. Bridges-2 hosts both community (public) and private datasets, providing rapid access for individuals, collaborations and communities with appropriate protections.

These datasets are available to anyone with a Bridges-2 account:

 

2019nCoVR: 2019 Novel Coronavirus Resource

The 2019 Novel Coronavirus Resource concerns the outbreak of novel coronavirus in Wuhan, China since December 2019. For more details about the statistics, metadata, publications, and visualizations of the data, please visit https://bigd.big.ac.cn/nco.

Available on Bridges-2 at /ocean/datasets/community/genomics/2019nCoVR.

COCO

COCO (Common Objects in Context) is a large scale image dataset designed for object detection, segmentation, person keypoints detection, stuff segmentation, and caption generation. Please visit http://cocodataset.org/ for more information on COCO, including details about the data, paper, and tutorials.

Available on Bridges-2 at /ocean/datasets/community/COCO.

PREVENT-AD

The PREVENT-AD (Pre-symptomatic Evaluation of Experimental or Novel Treatments for Alzheimer Disease) cohort is composed of cognitively healthy participants over 55 years old, at risk of developing Alzheimer Disease (AD) as their parents and/or siblings were/are affected by the disease. These ‘at-risk’ participants have been followed for a naturalistic study of the presymptomatic phase of AD since 2011 using multimodal measurements of various disease indicators. Two clinical trials intended to test pharmaco-preventive agents have also been conducted. The PREVENT-AD research group is now releasing data openly with the intention to contribute to the community’s growing understanding of AD pathogenesis.

Available on Bridges-2 at /ocean/datasets/community/prevent_ad.

ImageNet

ImageNet is an image dataset organized according to WordNet hierarchy. See the ImageNet website for complete information.

Available on Bridges-2 at /ocean/datasets/community/imagenet.

Natural Languge Tool Kit Data

NLTK comes with many corpora, toy grammars, trained models, etc. A complete list of the available data is posted at: http://nltk.org/nltk_data/.

Available on Bridges-2 at /ocean/datasets/community/nltk.

MNIST

Dataset of handwritten digits used to train image processing systems.

Available on Bridges-2 at /ocean/datasets/community/mnist.

Genomics datasets

These datasets  are available to anyone with a Bridges-2 allocation. They are stored under /ocean/datasets/community/genomics.

BLAST

The BLAST databases can be accessed through the environment variable $BLAST_DATABASE after loading the BLAST module.

Prokka

The Prokka databases can be accessed through the environment variable $PROKKA_DATABASES after loading the Prokka module.

Pfam

The Pfam database is available at /ocean/datasets/community/genomics/pfam.

Gateways

Bridges-2 hosts a number of gateways – web-based, domain-specific user interfaces to applications, functionality and resources that allow users to focus on their research rather than programming and submitting jobs. Gateways  provide intuitive, easy-to-use interfaces to complex functionality and data-intensive workflows.

Gateways can manage large numbers of jobs and provide collaborative features, security constraints and provenance tracking, so that you can concentrate on your analyses instead of on the mechanics of accomplishing them.

Security Guidelines and Policies

PSC policies regarding privacy, security and the acceptable use of PSC resources are documented here. Questions about any of these policies should be directed to PSC User Services.

See also policies for:

Security Measures

Security is very important to PSC. These policies are intended to ensure that our machines are not misused and that your data is secure.

What You Can Do:

You play a significant role in security!  To keep your account and PSC resources secure, please:

  • Be aware of and comply with PSC’s policies on security, use and privacy found in this document
  • Choose strong passwords and don’t share them between accounts or with others. More information can be found in the PSC password policies.
  • Utilize your local security team for advice and assistance
  • Take the online XSEDE Cybersecurity Tutorial. Go to Online Training and click on “XSEDE Cybersecurity (CI-Tutor)
  • Keep your computer properly patched and protected
  • Report any security concerns to the PSC help desk ASAP by calling the PSC hotline at: 412-268-4960 or email help@psc.edu
What We Will Never Do:
  • PSC will never send you unsolicited emails requesting confidential information.
  • We will also never ask you for your password via an unsolicited email or phone call.

Remember that the PSC help desk is always a  phone call away to confirm any correspondence at  412-268-4960.

If you have replied to an email appearing to be from PSC and supplied your password or other sensitive information, please contact the help desk immediately.

What You Can Expect:
  • We will send you email when we need to communicate with you about service outages, new HPC resources, and the like.
  • We will send you email when your password is about to expire and ask you to change it by using the web-based PSC password change utility.

Other Security Policies

Be mindful of your RM usage

Bridges RM users: note that the new Bridges-2 RM nodes have 128 cores per node, a significant...