Coma cluster image

The Coma Cluster contains more than 1,000 galaxies. Scientists have long been frustrated by large uncertainties in its mass.

Predicted mass of huge Coma Cluster agrees with earlier, human-intensive attempts; offers fast, accurate measurement needed to understand early Universe

For cosmologists trying to study the formation of the Universe, knowing the mass of everything is critical. But the need to estimate the mass of dark matter, which can’t be observed directly, limits their accuracy. A team of scientists led from Carnegie Mellon University (CMU) has trained artificial intelligence (AI) on data from simulated clusters of galaxies, in which the composition of all the components is known. This AI went on to predict a mass for the real-world Coma Cluster of galaxies that agrees with those from earlier, more human-labor-intensive attempts. The result offers the possibility of faster, more accurate assessment of the masses of galaxy clusters.


One of the biggest questions that science is trying to answer is how Everything came about. We’ve learned a lot about the origins of the Universe in a Big Bang 13.7 billion years ago, in broad terms. But the details sometimes don’t add up.

Today, cosmologists are wrestling with how galaxy clusters form and persist. The hundreds to thousands of galaxies these vast structures contain appear to be moving too fast for their collective gravity to keep them together. Even when scientists take into account mysterious dark matter — which is impossible to detect directly despite making up 85 percent of the matter in the Universe — the uncertainties are much larger than scientists are comfortable with.

Because galaxies in a cluster revolve around the center of its mass, scientists can tell how much mass is in that cluster by how fast they’re moving. The galaxies revolving away from us are slightly red shifted – much like the lower tone of a train moving away, their light is a bit more red. Light from the galaxies moving toward us are in the same way shifted a bit more blue. Measuring the difference between the two shows how fast the galaxies are wheeling around. Higher speeds means there has to be more mass holding the cluster together. But the need to estimate the (invisible) dark matter, hot ionized gasses and visible galaxies means large uncertainties. Also, scientists haven’t yet worked out the three-dimensional structures of the clusters, which further limits their confidence that they understand what’s going on.

Galaxy clusters are exactly what they sound like … groups of hundreds to thousands of galaxies that all seem to be in an equilibrium orbit around each other. But realistically, the amount of matter in the individual galaxies isn’t enough to … keep them all in orbit … Understanding their distribution in space and time is very important for us to constrain models of cosmology.”—Matthew Ho, CMU

Matthew Ho, a graduate student working in Hy Trac’s group at the McWilliams Center for Cosmology at CMU, wanted to know whether there was a way using AI to determine the mass of the Coma Cluster, a huge array of galaxies 321 million light years from Earth. An AI approach to the problem, he reasoned, would allow the mass of galaxy clusters to be estimated much more quickly than the painstaking surveys of the past. Just as importantly, it offered a way around the uncertainties — as well as, potentially, other biases that humans inevitably introduce with their initial assumptions.

Working with colleagues at CMU, Johns Hopkins University and the University of California, Santa Barbara, Ho turned to PSC’s Bridges-2 advanced research computer, as well as Vera, the CMU Department of Physics supercomputing system run by PSC.


To tackle the Coma Cluster problem, Ho would use a powerful AI tool called deep learning. This type of AI works by first feeding the computer data in which the right answer is labeled by humans. Because the computer is so much faster than humans, it can learn how to connect the data with the correct answer by trial and error. Initially, it creates a series of interconnected “layers” that represent different aspects of the data. It then adjusts these connections until its answers match the human-supplied labels. Once it does that, scientists test the AI against data that isn’t labeled. Once it gives correct answers in this testing phase, it’s ready to work on data for which humans don’t already have the answers.

Constructing an accurate training data set, then, is key to getting good results. This is particularly the case when we know that the real data have issues, such as the ones that limit the cluster mass measurements. So Ho used his AI to analyze earlier simulations of galaxy clusters on the National Science Foundation-funded Bridges-2 as well as Vera. By using artificial galaxy clusters whose composition was completely known, he could be sure that the computer was working with accurate data.

Creating accurate artificial galaxy clusters, though, was a tall order given that the simulation had to include so many “particles.” In all, the simulation would begin with hundreds of gigabytes of data, enough to fill dozens if not hundreds of laptops. Then it would have to carry out computations on that data, which would balloon the electronic bits being juggled.

Bridges-2’s core concept of combining high performance computing, artificial intelligence/machine learning and Big Data analysis is nicely aligned with the computing requirements of our project. In particular, the large-memory nodes and GPU nodes provide versatility, performance and scalability.”—Hy Trac, CMU

Bridges-2’s Big Data capabilities made it an ideal fit for the problem. With large memory nodes of 512 GB and 4,000 GB, it could fit all the data in one node, greatly speeding the largest simulation processing tasks by cutting down the time necessary for communications between nodes. Along with processing on Vera, this allowed Ho to create a clean training data set that his AI program, also running on Vera, used to learn how to judge galaxy cluster mass. In previous work the team had also used Bridges-2’s advanced GPU nodes, perfect for the many, parallel AI computations needed.

When let loose on real-world data from the Coma Cluster, the AI produced results that agreed with previous, human-guided estimates of the galaxy cluster’s mass. This result lent credence to the earlier attempts to remove the observation biases, as the computer had started with none of the assumptions that the humans had. It also gave Ho confidence that the computer was giving a correct answer, not just one that agreed with the earlier studies. More importantly, it suggests that the AI is capable, when given data for other real galaxy clusters, of producing similarly reliable results. The scientists published their results in the journal Nature Astronomy in June, 2022.