Bridges-2

ACCESS has replaced XSEDE. View the FAQ page for information.

Bridges-2, PSC’s flagship supercomputer, began production operations in March 2021. It is funded by a $10-million grant from the National Science Foundation.

Bridges-2 provides transformative capability for rapidly evolving computation- and data-intensive research, and creates opportunities for collaboration and convergence research. It supports both traditional and non-traditional research communities and applications. Bridges-2 integrates new technologies for converged, scalable HPC, machine learning and data; prioritizes researcher productivity and ease of use; and provides an extensible architecture for interoperation with complementary data-intensive projects, campus resources, and clouds.

Bridges-2 is available at no cost for research and education, and at cost-recovery rates for other purposes.

Close
Core Concepts
  • Converged HPC + AI + Data
  • Custom topology optimized for data-centric HPC, AI, and HPDA
  • Heterogeneous node types for different aspects of workflows
  • CPUs and AI-targeted GPUS
  • 3 tiers or per-node RAM: 256GB, 512GB, 4TB
  • Extremely flexible software environment
  • Community data collections & Big Data as a Service
Innovation
  • AMD EPYC 7742 CPUs: 64-core2.25–3.4 GHz
  • AI scaling to 192 V100-32GB SXM2 GPUs
  • 100TB, 9M IOPs flash array accelerates deep learning training, genomics, and other applications
  • Mellanox HDR-200 InfiniBand doubles bandwidth & supports in-network MPI-Direct, RDMA, GPUDirect, SR-IOV, and data encryption
  • Cray ClusterStor E1000 Storage System
  • HPE DMF single namespace for data security and expandable archiving
  • Converged HPC + AI + Data
Regular Memory

Regular Memory (RM) nodes will provide extremely powerful, general-purpose computing, machine learning and data analytics, AI inferencing, and pre- and post-processing.

488 RM nodes will have 256GB of RAM, and 16 will have 512GB of RAM.

All RM nodes will have:

  • NVMe SSD (3.84TB)
  • Mellanox ConnectX-6 HDR Infiniband 200Gb/s Adapter
  • Two AMD EPYC 7742 CPUS, each with:
    • 64 cores
    • 2.25-3.40GHz
    • 256MB L3
    • 8 memory channels
Extreme Memory

Bridges-2 Extreme Memory (EM) nodes will provide 4TB of shared memory for genome sequence assembly, graph analytics, statistics, and other applications requiring a large amount of memory for which distributed-memory implementations are not available.

Each of Bridges-2’s 4 EM nodes will consist of:

  • 35.75MB LLC 6 memory channels
  • 4TB of RAM: DDR4-2933
  • NVMe SSD (7.68TB)
  • Mellanox ConnectX-6 HDR InfiniBand 200Gb/s Adapter
  • Four Intel Xeon Platinum 8260M “Cascade Lake” CPUs:
    • 24 cores
    • 2.40–3.90GHz
GPU

Bridges-2 24 GPU nodes provide exceptional performance and scalability for deep learning and accelerated computing, with a total of 40,960 CUDA cores and 5,120 tensor cores.

Each GPU node will contain:

  • 512GB of RAM: DDR4-2933
  • 7.68TB NVMe SSD
  • Two Mellanox ConnectX-6 HDR Infiniband 200Gb/s Adapter
  • Eight NVIDIA Tesla V100-32GB SXM2 GPUs
  • 1 Pf/s tensor
  • Two Intel Xeon Gold 6248 “Cascade Lake” CPUs:
    • 20 cores, 2.50–3.90GHz, 27.5MB LLC, 6 memory channels
Bridges-2 FAQs

Why do I get an error when I try to start an interactive session on an EM node?

Because there are only 4 EM nodes, interactive access is not permitted.  Please submit a job through SLURM. For more information, see the Running Jobs section of the Bridges-2 User Guide.

 

What’s the maximum time a job can run?

It depends on which partition you are submitting to.  Each partition has a maximum time limit set. However, these limits can change at any time.

To see what the current limits are, type

sacctmgr show qos format=name%15,maxwall | grep partition

The output will show the maximim time limits for partitions, where the time format is Days-Hours:Minutes:Seconds.

The limit shown for ‘rmpartition’ applies to both the RM and RM-shared partitions. Similarly, the limit for ‘gpupartition’ applies to both the GPU and GPU-shared partitions.

rmpartition 2-00:00:00
gpupartition 2-00:00:00
empartition 5-00:00:00
rm512partition 2-00:00:00

Here you can see that the maximum time allowed in RM and RM-shared partitions is two days, or 24 hours.  The maximum time allowed in  the EM partition is five days.

All scheduling policies, including the time limits, are always under review to  ensure the best turnaround for users, and are subject to change at any time.

 

 

Why was I charged for 128 RM cores when I used less than that?

Your job probably ran in the RM partition.  Jobs in the RM partition use one or more full RM nodes, which have 128 cores each.

If you need 64 cores or less, you can use the RM-shared partition. Jobs in RM-shared use half or less of one RM node.

See the Partitions section of the Bridges-2 User Guide for more information.

 

 

What is the difference between the RM and RM-shared or GPU and GPU-shared partitions?

Jobs in the RM partition use one or more full RM nodes, and are allocated all 128 cores on those nodes.  Jobs in the RM-shared partition use only half of the cores (or less) on one RM node, and share the node with other jobs.

Similarly, jobs in the GPU partition use one or more entire GPU nodes, and are allocated all 8 GPUs on each node.  Jobs in GPU-shared use at most 4 GPUs and share the node with other jobs.

Jobs in RM and GPU partitions are charged for the entire node (128 cores or 8 GPUs, respectively).  Jobs in RM-shared  and GPU-shared are only charged for the cores or GPUs that they are allocated.

See the Partitions section of the Bridges-2 User Guide for more information.

 

 

Can I reserve nodes on Bridges-2?

Yes, if you have a significant reason that requires setting aside nodes for your exclusive use.  Your account will be charged for the entire length of the reservation.

See the Reservation section of the Bridges-2 User Guide for more information.

 

 

SLURM error messages: What does this salloc or sbatch error mean?

Here are SLURM error messages for some common issues.  If you have questions about these or other SLURM errors you see, please contact help@psc.edu.

salloc: error: Job submit/allocate failed: Invalid qos specification

This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use.  To check, run the projects command to verify that you have access to that resource.

It is also possible that you have multiple projects, and those projects have access to different sets of resources.  If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.

In a batch job, use  #SBATCH -A ChargeID.

In an interact command , use interact -A ChargeID.

sbatch: error: Allocation requesting N gpus, GPU-shared maximum is 4
sbatch: error: Batch job submission failed: Access/permission denied

You are asking for more than 4 GPUs in the GPU-shared partition.  Jobs in GPU-shared can only request up to half of one GPU node, a total of 4 GPUs.

sbatch: error: Allocation requesting N nodes, use GPU partition for multiple nodes
sbatch: error: Batch job submission failed: Access/permission denied

You are asking for multiple GPU nodes in the GPU-shared partition. To request multiple GPU nodes, you must use the GPU partition.

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use.  To check, run the projects command to verify that you have access to that resource.

It is also possible that you have multiple projects, and those projects have access to different sets of resources. If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.

In a batch job, use #SBATCH -A ChargeID

In an interact command , use interact -A ChargeID

sbatch: error: QOSMaxCpuPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

This error generally indicates that you are asking for more cores than allowed in the partition.  For example, jobs in the RM-shared partition are limited to half of one node, which is 64 cores maximum.

sbatch: error: QOSMaxWallDurationPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

Most often this error indicates that you are requesting more time than is allowed in a partition.

Make sure to check the maximum time allowed for a partition in the Running Jobs section of the Bridges-2 User Guide.

 

PSC Receives International Honors for AI-Driven, Automated Discovery of MRI Agents and Control of Fluid-Flow Heat and Stress

Thirteenth Year PSC Is Recognized by HPCwire Awards, Given to Leaders in the Global High Performance Computing Community

Artificial Intelligence Learns to Judge Mass of Galaxy Clusters

Predicted mass of huge Coma Cluster agrees with earlier, human-intensive attempts; offers fast, accurate measurement needed to understand early Universe

Students use Bridges-2 to Simulate Physical Stress in Carbon Nanotubes

Resulting database will enable AI exploration of movement-sensitive electrical components

SLURM environment change coming June 9

During the June 8-9 downtime, we will correct settings on Bridges-2 so that no variables defined in a login session will be inherited by SLURM jobs. This includes both batch and interactive jobs. Currently, environment variables defined in the login shell, including...

AI Predicts Synthesizability of Crystals via Abstract Images

Images derived from crystal structures help neural network running on Bridges-2 to predict ability to create a given crystal in the real world

Inode quota implemented on Bridges-2

Due to the current usage patterns on Bridges-2's Ocean filesystem, an inode allocation has been established. It will be enforced in addition to the storage quota for your grant. The inode allocation is proportional to the size of your storage quota, and is set at 6070...