Utilizing Bridges-2 for Deep Learning Distributed Training

September 26, 2023, 1-2pm ET


During this webinar, we will show examples of how to deploy multi-GPU training for deep learning applications, such as with Pytorch and Tensorflow frameworks, using Bridges-2. Planned topics include how to modify the code to run training on a single node or multiple nodes, set up the environment, run jobs with either interactive mode, batch job, and Jupyter notebook, and discuss various factors that affect the performance.   

About the presenter

Mei-Yu Wang acquired her Ph.D. in astrophysics from University of Pittsburgh. Her doctoral research focused on developing novel probes for studying dark matter. She did postdoctoral research in studying dark matter and the Milky Way at the Texas A&M University and Carnegie Mellon University befores she joined the HPC AI and Big Data Group group at PSC in 2022. Her primary roles now include addressing support requests and developing tests and benchmarking for the Neocortex system and the Open Compass project.


  • Overview of the Bridges-2 GPU partition
  • Overview of the deep learning distributed framework 
    • Pytorch 
    • Tensorflow 
    • Horovod
  • Examples of how to run on Bridges-2
    • Batch job/interactive mode/Open OnDemand
    • How to set up the environment
    • Explanation of the script/code structure
    • Optional: demo 
  • Examples of distributed training performance on Bridges-2
    • Single node
    • Multiple node
  • Conclusion