XSEDE Resources, Trinity Enable Non-Human Primate Reference Transcriptome Resource to Support Study of Genes in Our Closest Relatives
Dec. 6, 2013
In the card game “Concentration,” you place the 52 cards in a deck face down on a table. You turn one over; then you turn over another, with the intent of matching the numbers. If they don’t match, you turn them face down again. Then you repeat, trying to find the matches.
Memory is paramount. When you see a “6” card, you have to remember where you last saw that number.
Now imagine a game of Concentration in which the task not only involves 3 billion cards, but also the multiple ways they can be strung together to make winning poker hands. And imagine what kind of memory you’d need to find the matches.
Thanks in part to XSEDE resource Blacklight’s best-in-world shared memory and the Pittsburgh Supercomputing Center (PSC) Data SuperCell’s ability to store and move huge amounts of data in a fluid and accessible way, the laboratory of Christopher Mason, assistant professor in the departments of Physiology and Biophysics and the Institute for Computational Biomedicine, Weil Cornell Medical College, and colleagues have spearheaded the first repository of the active genes in 13 nonhuman primates. The effort has been led by Lenore Pipes, an NSF graduate research fellow and student of Mason and Adam Siepel, associate professor of computational biology, Cornell University.
Reported in a January 2013 Nucleic Acids Research paper, the Non-Human Primate Reference Transcriptome Resource (NHPRTR) provides an electronic infrastructure to support researchers who are sequencing, comparing, and trying to understand genes in mankind’s closest relatives.
Getting the geography down
Understanding the genomes of the nonhuman primates — great apes such as chimpanzees, old world monkeys, new world monkeys and more primitive prosimians, such as lemurs — is important for understanding ourselves in health and disease, Mason explains.
“Nonhuman primates are widely used models in pharmaceutical research,” he says. “Also, understanding their genomes allows us to answer evolutionary questions about how the human genome came to be.”
But the state of the field in primate genome research is uneven, depending on which species you look at.
“Not all primates’ genomes have been sequenced,” Mason says. Chimps and rhesus monkeys have been sequenced — but not enough times to ensure good proofreading. “Even for those that have been sequenced, there is an incomplete or poor annotation of what genes are present in these species.”
Annotation is critical. If the genome were a map, for example, the DNA sequence would represent lines for the roads and circles for the cities. Annotation is the process of putting labels on that map.
“If you took a normal map but removed all the names of the cities, you wouldn’t know where Philadelphia is, where Pittsburgh is,” Mason says. “You would have no sense of geography. Annotation lets you navigate the human and primate genomes in the same way that an annotated map lets you navigate the U.S.”
Uncovering the cards
The initial work for the NHPRTR consisted of obtaining the sequences for the active genes in 13 primate genomes.
In the cell, DNA is the master copy of the genetic material, encoding the blueprints for making the cell’s components in a series of bases: A, T, G and C. In order to express a gene, the cell copies its DNA sequence to RNA, a molecule closely related to DNA. The cell translates some of these RNA copies into proteins, the main actors in the cell. Other RNAs carry out specific functions on their own.
The process of copying a gene’s DNA code into RNA is called transcription. The transcriptome is the collection of RNAs that are being expressed in an organism’s living tissues.
The process of reading the RNA code requires fairly short segments — about 100 bases, for optimum efficiency. Because the genome is so much larger than that, researchers must first cut it into small, overlapping fragments. Once they have the sequences of these fragments — about 600 million of them for a typical species, though some of the primate group’s analyses looked at as many as 3 billion — the researchers can then reconstruct the entire transcriptome sequence by matching where the segments overlap. It’s much like playing a game of Concentration that follows all the ways that 3 billion cards could make a winning hand.
But it gets harder. Many of the sequences are nearly, but not completely, identical. In addition, inevitably there are some errors in the sequence that have to be corrected, by covering each bit of the sequence multiple times. It’s hard to piece imperfect, redundant and nearly identical bits together in the proper order. In order to get it right, the scientists must construct a chart of possible ways of stringing them together, testing each possibility in turn. These charts are called De Bruijn graphs.
Blacklight and Data SuperCell: All about the memory
With help from XSEDE Extended Collaborative Support Service (ECSS) staff at PSC, that center's Blacklight system speeded the calculation behind the matching considerably, Pipes says. A traditional, “distributed memory” supercomputer would have solved the problem by raw speed only, essentially uncovering each “card” — possibility in the De Bruijn graph — one at a time, then turning it face down to check another. Blacklight’s massive shared memory, though, made that unnecessary — the machine was able to keep many possibilities in its memory at once, allowing for far more rapid matching.
PSC’s Data SuperCell also allowed the researchers to use their massive amounts of data efficiently, making it available in large chunks with minimal retrieval delay.
Running the problem on Blacklight required ECSS staff to work with the developers of the De Bruijn graph software, Trinity, to optimize its performance on the new machine.
“The primate NHPRTR study was the biggest thing anyone had tried to do with Trinity,” says Philip Blood, PSC senior scientific specialist and who helped the researchers get Trinity running on Blacklight. Getting the program to work on the new platform required painstaking trial and error.
Annotation: Table of contents for expressed genes
Some primate genomes have been sequenced already. More are likely to come as the technology becomes quicker, easier, and cheaper.
“Sequencing DNA is really only the first step in understanding a genome,” Mason says. “You need to know what’s expressed, what’s active” from a DNA sequence, in the RNA, to make sense of how an animal’s genome works, he adds. “Our transcriptome maps creates the first catalog of functional, active elements in a given genome. That is the first essential step of delineating the molecular recipe that defines the synthesis of the entire organism, from one cell to the trillions of cells in an adult.”
The transcriptome contains the sequences of all the active genes in a given cell or tissue. By comparing the sequences in the transcriptome to those in the genome, researchers can tell which genes are active. It’s a particularly important question in understanding why humans and chimpanzees are different. Both species’ DNA is very similar — 96 percent of the sequences are identical. Because of that, many researchers suspect that they may express different genes at different times in specific tissues, painting a very different picture out of an almost identical palette of colors. This is the sort of question that the NHPRTR will help primate researchers to tackle, Pipes says.
At the moment, the NHPRTR represents 13 key primate species identified by researchers in the field. In the next stage of the research, they will focus on deep sequencing of matching tissues from various individuals, including the molecular characterization of the various brain regions between these animals. They’ll also look at the evolution of gene structure, alternative splicing and the catalog of species-specific genes. Taken together, these data inform not only the functional map of genes for these primates, but ultimately help answer the age-old question at a genetic level, “What makes us human?”