Bridges User Guide: Data Collections

A community dataset space allows Bridges’ users from different grants to share data in a common space.  Bridges hosts both public and private datasets, providing rapid access for individuals, collaborations and communities with appropriate protections.

Community datasets are appropriate when data will be shared amongst Bridges’ groups.  Any data that should only be accessed by one group should be stored in that group’s pylon5 space.

Public datasets

Some data collections are available to anyone with a Bridges’ account.  They include:

2019nCoVR: 2019 Novel Coronavirus Resource

Last updated: August 27, 2020

The 2019 Novel Coronavirus Resource concerns the outbreak of novel coronavirus in Wuhan, China since December 2019. For more details about the statistics, metadata, publications, and visualizations of the data, please visit

Available on Bridges at /pylon5/datasets/community/genomics/2019nCoVR.


COCO (Common Objects in Context) is a large scale image dataset designed for object detection, segmentation, person keypoints detection, stuff segmentation, and caption generation. Please visit for more information on COCO, including details about the data, paper, and tutorials.

Available on Bridges at /pylon5/datasets/community/COCO


The PREVENT-AD (Pre-symptomatic Evaluation of Experimental or Novel Treatments for Alzheimer Disease) cohort is composed of cognitively healthy participants over 55 years old, at risk of developing Alzheimer Disease (AD) as their parents and/or siblings were/are affected by the disease. These ‘at-risk’ participants have been followed for a naturalistic study of the presymptomatic phase of AD since 2011 using multimodal measurements of various disease indicators. Two clinical trials intended to test pharmaco-preventive agents have also been conducted. The PREVENT-AD research group is now releasing data openly with the intention to contribute to the community’s growing understanding of AD pathogenesis.

Available on Bridges at /pylon5/datasets/community/prevent_ad


ImageNet is an image dataset organized according to WordNet hierarchy.  See the ImageNet website for complete information.

Available on Bridges at /pylon5/datasets/community/imagenet

Natural Languge Tool Kit Data

NLTK comes with many corpora, toy grammars, trained models, etc. A complete list of the available data is posted at:

Available on Bridges at /pylon5/datasets/community/nltk


Dataset of handwritten digits used to train image processing systems.

Available on Bridges at /pylon5/datasets/community/mnist

Genomics Data

Several genomics datasets are publicly available, in addition to the 2019nCoVR: 2019 Novel Coronavirus Resource listed above.

The BLAST databases can be accessed through the environment variable $BLASTDB after loading the BLAST module.
CAMI (Critical Assessment of Metagenome Interpretation) is a community-led initiative designed to help tackle challenges in metagenome assembly and analysis by aiming for an independent, comprehensive and bias-free evaluation of methods. Data from the first CAMI challenge is available at /pylon5/datasets/community/genomics/cami.

Repbase is the most commonly used database of repetitive DNA elements. You must register with RepBase at and send proof of registration to in order to use the Repbase database.
The University of California at Santa Cruz reference genomes are available at /pylon5/datasets/community/genomics/UCSC.  The collection includes human, mouse and drosophila genomes.
Other genomics datasets
Other available datasets are typically used with a particular genomics package.  These include:
Barrnap /pylon5/datasets/community/genomics/barrnap
CheckM /pylon5/datasets/community/genomics/checkm
Dammit uniref90
Homer /pylon5/datasets/community/genomics/homer
Kraken /pylon5/datasets/community/genomics/kraken
Long Ranger /pylon5/datasets/community/genomics/longranger
MetaPhlAn2 /pylon5/datasets/community/genomics/metaphlan2
Phylosift /pylon5/datasets/community/genomics/phylosift
Prokka /pylon5/datasets/community/genomics/prokka

Other useful datasets

A list of datasets that may be useful follows. These datasets are not currently installed on Bridges, but can be copied to your pylon5 space.

Deep Learning

Keras Datasets for Import

These datasets are available from

  • CIFAR10 small image classification
  • CIFAR100 small image classification
  • IMDB Movie reviews sentiment classification
  • Reuters newswire topics classification
  • MNIST database of handwritten digits
  • Fashion-MNIST database of clothing
  • Boston housing price regression dataset (from CMU)

Image Databases

Natural Language Processing

Audio and Audio-Visual

Machine Learning

Scikit-Learn Datasets for Import

Multi-class classification and clustering


Binary Classification

Univariate Time Series

Multivariate Time Series