Bridges-2 FAQ

Getting an allocation on Bridges-2

How can I get access to use Bridges-2?

Researchers or educators affiliated with academic or non-profit research institutions in the United States may apply to use PSC resources through the NSF’s ACCESS program.

For specifics on how to apply for a grant of time, visit the ACCESS Allocations page.

Can I get a grant to use Bridges-2 in teaching a course?

A primary mission of the Pittsburgh Supercomputing Center is to train students, including undergraduates, in high performance computing. To this end, PSC offers Coursework allocations which are grants of free supercomputing time to supplement other teaching tools.

Apply through the NSF ACCESS program for an allocation. During the allocation process, you will indicate that the application is for a coursework grant.

Running jobs

Why do I get an error when I try to start an interactive session on an EM node?

Because there are only 4 EM nodes, interactive access is not permitted. Please submit a job through SLURM. For more information, see the Running Jobs section of the Bridges-2 User Guide.

What's the maximum time a job can run?

It depends on which partition you are submitting to. Each partition has a maximum time limit set. However, these limits can change at any time.

To see what the current limits are, type:

sacctmgr show qos format=name%15,maxwall | grep partition

The output will show the maximum time limits for partitions, where the time format is Days-Hours:Minutes:Seconds.

The limit shown for ‘rmpartition’ applies to both the RM and RM-shared partitions. Similarly, the limit for ‘gpupartition’ applies to both the GPU and GPU-shared partitions:

rmpartition 2-00:00:00
gpupartition 2-00:00:00
empartition 5-00:00:00
rm512partition 2-00:00:00

Here you can see that the maximum time allowed in RM and RM-shared partitions is two days, or 24 hours. The maximum time allowed in the EM partition is five days.

All scheduling policies, including the time limits, are reviewed regularly to ensure the best turnaround for users, and are subject to change at any time.

Why was I charged for 128 RM cores when I used less than that?

Your job probably ran in the RM partition. Jobs in the RM partition use one or more full RM nodes, which have 128 cores each.

If you need 64 cores or less, you can use the RM-shared partition. Jobs in RM-shared use half or less of one RM node.

See the Partitions section of the Bridges-2 User Guide for more information.

What is the difference between the RM and RM-shared or GPU and GPU-shared partitions?

Jobs in the RM partition use one or more full RM nodes and are allocated all 128 cores on those nodes. Jobs in the RM-shared partition use only half of the cores (or less) on one RM node, and share the node with other jobs.

Similarly, jobs in the GPU partition use one or more entire GPU nodes, and are allocated all 8 GPUs on each node. Jobs in GPU-shared use a maximum of 4 GPUs and share the node with other jobs.

Jobs in RM and GPU partitions are charged for the entire node (128 cores or 8 GPUs, respectively). Jobs in RM-shared and GPU-shared are only charged for the cores or GPUs that they are allocated.

See the Partitions section of the Bridges-2 User Guide for more information.

Can I reserve nodes on Bridges-2?

Yes, if you have a significant reason that requires setting aside nodes for your exclusive use. Your account will be charged for the entire length of the reservation.

See the Reservation section of the Bridges-2 User Guide for more information.

SLURM errors

SLURM error messages: What does this salloc or sbatch error mean?

Here are SLURM error messages for some common issues. If you have questions about these or other SLURM errors you see, please contact help@psc.edu.

salloc: error: Job submit/allocate failed: Invalid qos specification

This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use. To check, run the projects command to verify that you have access to that resource.

It is also possible that you have multiple projects, and those projects have access to different sets of resources. If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.

In a batch job, use #SBATCH -A ChargeID.

In an interact command , use interact -A ChargeID.

sbatch: error: Allocation requesting N gpus, GPU-shared maximum is 4
sbatch: error: Batch job submission failed: Access/permission denied

You are asking for more than 4 GPUs in the GPU-shared partition. Jobs in GPU-shared can only request up to half of one GPU node, a total of 4 GPUs.

sbatch: error: Allocation requesting N nodes, use GPU partition for multiple nodes
sbatch: error: Batch job submission failed: Access/permission denied

You are asking for multiple GPU nodes in the GPU-shared partition. To request multiple GPU nodes, you must use the GPU partition.

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

This error most often occurs when you are trying to run a job on a resource that you do not have permissions to use. To check, run the projects command to verify that you have access to that resource.

It is also possible that you have multiple projects, and those projects have access to different sets of resources. If that is the case, be sure to specify the correct ChargeID in your batch job or interact command.

In a batch job, use #SBATCH -A ChargeID

In an interact command , use interact -A ChargeID

sbatch: error: QOSMaxCpuPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

This error indicates that you are asking for more resources than allowed in the partition. Typically, it means that you have requested more cores than allowed. For example, jobs in the RM-shared partition are limited to half of one node, which is 64 cores maximum.

sbatch: error: QOSMaxWallDurationPerJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)

Most often, this error indicates that you are requesting more time than is allowed in a partition.

Make sure to check the maximum time allowed for a partition in the Running Jobs section of the Bridges-2 User Guide.

Miscellaneous

Where can I get more help?

See the Bridges-2 User Guide or email help@psc.edu.