Bridges Connects Evolutionary Biologists with Genomes of Wild Species
July 13, 2016
Why the Sumatran Rhinoceros Is Important
Depressing but true: things don’t look good for the Sumatran rhinoceros. This unique tropical species is all but extinct in the wild. To make matters worse, they aren’t doing well in zoos either. Recently the world zoo community started shipping their rhinos to Malaysia, so that the surviving captive animals can be maintained in a central location—and in a climate more suitable to their survival in captivity.
Which is tragic, because the Sumatran rhino almost certainly has a lot to tell us about evolutionary survival and other species’ responses to climate change. Its closest relative is the Ice Age wooly rhinoceros of Eurasia. Almost certainly, the ancestors of the Sumatran rhinoceros started out as a cool-weather grassland species that adapted to a warmer and wetter climate. Herman Mays, an evolutionary biologist at Marshall University in West Virginia, teamed up with Jim Denvir, co-director of the genomics core facility at Marshall’s School of Medicine, to use PSC’s Bridges system to piece together the DNA sequence of this animal before it vanishes—and when it may yet offer clues to the species’ survival.
“The whole-genome approach is a whole new world for evolutionary biologists studying wild animals. We can look at functional differences in the entire genome at once. This allows us to look at how species specialized and how they got to be the way they are today.”
—Herman Mays, Marshall University
Why the Narcissus Flycatcher Is Important
The Narcissus flycatcher is a bird with multiple personalities. As far north as the island of Hokkaido in Russia, they spend summers in high latitudes, then migrate south, to South China, Indochina and Borneo, for the winter. But there’s another population in the southern Ryukyu Islands of Japan that doesn’t follow these rules. Enjoying the more consistent, warm subtropical climate of these lower latitudes,the Ryukyu population doesn’t migrate.
This unique case—a single species, with both migratory and non-migratory populations—offers biologists insights into the genetic traits that have evolved to make long-distance migration possible. A team led by Herman Mays and Jim Denvir of Marshall University decided to apply advanced sequencing techniques and PSC’s Bridges system to create the first genetic sequence for the Narcissus flycatcher. Their aim is to understand how climate changes and genes interact to create this amazing phenomenon of migration.
“Working with Bridges I got a lot of help from [PSC’s] Phil Blood. He was very helpful in writing the code, as I didn’t have experience working on the [Bridges] environment. He also helped me determine how to do the job, estimate how long it would take and how to monitor its status.”
—Swanthana Rekulapally, Marshall University
How PSC and XSEDE Helped
Swanthana Rekulapally and undergraduate Megan Justice, working in Denvir’s team at the Marshall genomics facility, began by trying to assemble the sequence of the flycatcher first. The bird genome has 1 billion DNA bases, while the rhino genome has 3.3 billion.
But the team soon ran into some serious problems with the computing resources they had available. These limitations had to do with the nature of the “brute force” sequence assembly they needed to perform. Most “high-throughput” sequencing technology can only read 200 to 250 base pairs of nucleic acid at a time. When assembling a billion-base genetic sequence, you get a huge jumble of tens of millions of overlapping DNA “reads” that you need to piece together by computer, much as a person would assemble the pieces of a jigsaw puzzle.
For well-studied “model” species, such as humans, lab mice, fruit flies and the like, scientists already know the genetic sequence. A researcher assembling the genome of a new individual or a related species can use the known sequence to guide the assembly, much as we’d use the picture on the cover of a jigsaw puzzle box as a guide. But there aren’t existing assembled genomes for these wild species. There’s no box cover. Instead of a jigsaw puzzle, the task becomes more like the game Concentration. To find the overlaps, the computer needs to remember all the sequences it’s already looked at when it looks at a new DNA fragment. The more it can hold in its memory at once, the faster it can assemble the genome.
“Almost everything we do at our core facility has medical applications. We tend to work with human samples or mouse samples in which the genome is much better annotated and known. Studying rhinos and songbirds was a nice departure.”
—Jim Denvir, Marshall University
The supercomputers available to the Marshall scientists had 500 gigabytes of shared memory. This is powerful compared with the 16 gigabytes of RAM you might find on a high-end laptop computer. But it wasn’t enough—the flycatcher assembly kept crashing the system by running out of memory. That’s when Jack Smith, the National Science Foundation XSEDE network’s campus champion at Marshall, suggested they check out the supercomputers available through XSEDE. With help from XSEDE Extended Collaborative Support Service member and genomics expert Phil Blood at PSC, the team reviewed XSEDE’s resources. They decided the 3-terabyte (3,072 gigabyte) large memory nodes of PSC’s new Bridges system were what they needed.
With Blood’s help adapting their software to the Bridges environment, the large memory nodes did the trick: Assembling the flycatcher genome, which had crashed other systems, finished in only 6.6 hours. That’s at least five times faster than the failed assembly had been going, and much faster than they expected. Another surprise was that the rhino genome assembly went faster as well, with more than three times as much sequence assembled in 11 hours.
Now that the scientists have their data, they can begin analyzing it to answer a number of critical evolutionary questions: What genes contribute to a species adapting from one environment to another, and how does long-distance migration evolve? What traits contribute to a species being more or less resilient to climate change? Does looking at the entire genome at once give a fuller picture of why some species survive while others don’t? And can we use that knowledge to save more wild species?