User Information
User Guide
FAQ
Bridges to Bridges-2 Migration Guide

Apply
Research Allocations open 6/15 – 7/15

Early User Program Training
Watch the video
View the slides

NVIDIA Compilers Updated

The NVIDIA compilers (formerly PGI) have been updated on Bridges-2 to version 21.5. This is now...

Be mindful of your RM usage

Bridges RM users: note that the new Bridges-2 RM nodes have 128 cores per node, a significant...

Bridges-2 FAQ

Why do I get an error when I try to start an interactive session on an EM node?

Because there are only 4 EM nodes, interactive access is not permitted.  Please submit a job through SLURM. For more information, see the Running Jobs section of the Bridges-2 User Guide.

What's the maximum time a job can run?

It depends on which partition you are submitting to.  Each partition has a maximum time limit set. However, these limits can change at any time.

To see what the current limits are, type

sacctmgr show qos format=name%15,maxwall | grep partition

The output will show the maximim time limits for partitions, where the time format is Days-Hours:Minutes:Seconds.

The limit shown for ‘rmpartition’ applies to both the RM and RM-shared partitions. Similarly, the limit for ‘gpupartition’ applies to both the GPU and GPU-shared partitions.

rmpartition 2-00:00:00
gpupartition 2-00:00:00
empartition 5-00:00:00
rm512partition 2-00:00:00

Here you can see that the maximum time allowed in RM and RM-shared partitions is two days, or 24 hours.  The maximum time allowed in  the EM partition is five days.

All scheduling policies, including the time limits, are always under review to  ensure the best turnaround for users, and are subject to change at any time.

Why was I charged for 128 RM cores when I used less than that?

Your job probably ran in the RM partition.  Jobs in the RM partition use one or more full RM nodes, which have 128 cores each.

If you need 64 cores or less, you can use the RM-shared partition. Jobs in RM-shared use half or less of one RM node.

See the Partitions section of the Bridges-2 User Guide for more information.

What is the difference between the RM and RM-shared or GPU and GPU-shared partitions?

Jobs in the RM partition use one or more full RM nodes, and are allocated all 128 cores on those nodes.  Jobs in the RM-shared partition use only half of the cores (or less) on one RM node, and share the node with other jobs.

Similarly, jobs in the GPU partition use one or more entire GPU nodes, and are allocated all 8 GPUs on each node.  Jobs in GPU-shared use at most 4 GPUs and share the node with other jobs.

Jobs in RM and GPU partitions are charged for the entire node (128 cores or 8 GPUs, respectively).  Jobs in RM-shared  and GPU-shared are only charged for the cores or GPUs that they are allocated.

See the Partitions section of the Bridges-2 User Guide for more information.

SLURM error messages: What does this salloc or sbatch error mean?

Here are SLURM error messages for some common issues.  If you have questions about these or other SLURM errors you see, please contact help@psc.edu.

salloc: error: Job submit/allocate failed: Invalid qos specification

This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use.  To check, run the projects command to verify that you have access to that resource.

It is also possible that you have multiple projects, and those projects have access to different sets of resources.  If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.

In a batch job, use  #SBATCH -A ChargeID.

In an interact command , use interact -A ChargeID.

sbatch: error: Allocation requesting N gpus, GPU-shared maximum is 4
sbatch: error: Batch job submission failed: Access/permission denied

You are asking for more than 4 GPUs in the GPU-shared partition.  Jobs in GPU-shared can only request up to half of one GPU node, a total of 4 GPUs.

sbatch: error: Allocation requesting N nodes, use GPU partition for multiple nodes
sbatch: error: Batch job submission failed: Access/permission denied

You are asking for multiple GPU nodes in the GPU-shared partition. To request multiple GPU nodes, you must use the GPU partition.

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use.  To check, run the projects command to verify that you have access to that resource.

It is also possible that you have multiple projects, and those projects have access to different sets of resources. If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.

In a batch job, use #SBATCH -A ChargeID

In an interact command , use interact -A ChargeID

sbatch: error: QOSMaxCpuPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

This error generally indicates that you are asking for more cores than allowed in the partition.  For example, jobs in the RM-shared partition are limited to half of one node, which is 64 cores maximum.

sbatch: error: QOSMaxWallDurationPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

Most often this error indicates that you are requesting more time than is allowed in a partition.

Make sure to check the maximum time allowed for a partition in the Running Jobs section of the Bridges-2 User Guide.