Bridges

Advancing state of the art high-performance computing, communications and data analytics.

 

Bridges is a uniquely capable resource for empowering new research communities and bringing together HPC, AI and Big Data. Bridges is designed to support familiar, convenient software and environments for both traditional and non-traditional HPC users. Its richly-connected set of interacting systems offers exceptional flexibility for data analytics, simulation, workflows and gateways, leveraging interactivity, parallel computing, Spark and Hadoop. Bridges includes:

  • Compute nodes with hardware-supported shared memory ranging from 128GB  to 12TB per node to support genomics, machine learning, graph analytics and other fields where partitioning data is impractical 
  • GPU nodes to accelerate applications as diverse as machine learning, image processing and materials science
  • Database nodes to drive gateways and workflows and to support fusion, analytics, integration and data management
  • Webserver nodes to host gateways and provide access to community datasets
  • Data transfer nodes with 10 GigE connections to enable data movement between Bridges and XSEDE, campuses, instruments and other advanced cyberinfrastructure

Need more information? See the Bridges User Guide or the FAQ.  

Questions? PSC User Services is here to help you get your research started and keep it on track.  If you have questions at any time, you can send email to help@psc.edu.

FAQ

There are two parameters to consider:

  • the SLURM account id, which determines which grant the SUs used by a job are deducted from, and
  • the Unix group, which determines which group owns any files created by the job.

See "Managing multiple grants" in the Account Admininstration section of the Bridges User Guide for information on determing your default SLURM account id and Unix group, and changing them either permanently or temporarily, for just one job or login session.

 

"Invalid qos specification" means that you asked for a resource that you do not have access to.  For example:

  • submitting a job to one of the GPU partitions when you don't have a Bridges_GPU allocation
  • submitting a job to one of the LM partitions when you don't have a Bridges_Large allocation

Please check your grants to see what you have been allocated.  The projects command will list your grants and the resources allocated to each one.

This error can also occur when you have more than one active grant.  It's important to run jobs under the correct SLURM account id. Your jobs will run under your default SLURM account id unless you specify a different account id to use for a job. If the default grant does not have access to the resources you are requesting, you will get this error.

Use the projects command to find which grant is your default if you have more than one.

See

  • The Account administration section of the Bridges user Guide for information on finding your SLURM account ids, setting a nondefault account id for a specific job,  and changing your default account id permanently.

It is not possible to predict with any accuracy when your job will run.

The scheduler on Bridges is largely FIFO. The squeue command lists the running and queued jobs on Bridges in FIFO order. However, jobs can move up in the queue if a slot becomes available on the machine, this job will fit in the open slot, and others ahead of it in the FIFO ordering cannot fit. In addition, jobs can finish before their requested walltime for a variety of reasons.

There can be many reasons that a job is waiting in the queue.

  • There are jobs in front of you in the queue. If the status field in the output from the command squeue -l has 'Priority' in the status field, this is the case.
  • There is a maximum amount of nodes your RM jobs can cummulatively request. The maximum limit varies based on the load on the system. If the status field in the squeue -l output says 'QOSMaxCPUPerUserLimit', this is the case.
  • There is a maximum amount of cores your RM-shared jobs can cummulatively request. This maximum limit varies based on the load on the system. If the status field in the squeue -l output says 'QOSMaxCPUPerUserLimit', this is the case.
  • There is a maximum number of GPU nodes your GPU jobs  can simultaneously use. This limit applies to the p100 and k80 nodes combined. If the status field in the squeue -l output says 'QOSMaxGRESPerUser', this is the case.
  • The partition you want to use is down. If the status field in the squeue -l output says 'PartitionDown', this is the case.
  • There are currently no nodes available to run your job. This can be because
    • They are already running jobs
    • There are a lot of reserved nodes. The sinfo command shows reserved nodes.
    • There are a lot of down nodes. The sinfo command shows down nodes.
    • They are being reserved for a job with a reservation that is about to start
    • They are being reserved for a job at which a system drain is targeted. If the status field in the squeue -l output says 'ReqNodeNotAvailable', this is the case. The output is somewhat misleading because it will list all nodes on the machine which are unavailable to run jobs, even nodes on which your job could not run because they are in a different partition.

There are several techniques you can use to try to improve your turnaround, although, since the scheduler is FIFO with backfill, you may just have to wait your turn.

  1. Estimate your walltime as closely as possible without underestimating.
  2. Use flexible walltime by using the --time-min option to the sbatch command. The use of this option is described in the Running Jobs section of the Bridges User Guide.
  3. Use job bundling (also called job packing) to combine your jobs. This combines several smaller jobs into one. There is a better chance for one job to start than multiple smaller jobs. Job bundling is described in the Sample Scripts section of the Bridges User Guide.
  4. Space out your job submissions. Bridges runs a fairshare scheduler. If you have run a lot of jobs recently, the priority of your queued jobs will be reduced slightly until they have accumulated some waiting time in the queue.
  5. Consider whether this task needs to be run on Bridges. If you can conveniently complete it on your local system, you do not have to wait.

Introducing Bridges-2

Bridges-2, PSC's newest supercomputer, will debut in early 2021.  It will be funded by a $10-million grant from the National Science Foundation.