Structure of Proteins and DNA Watching a Protein Fold
With a tidal wave of sequence data starting to come in, what's crucial is computational resources.

In 1953, Max Perutz and John Kendrew of Cambridge University did something no one had ever done. They figured out the structure of a protein, two related proteins actually: hemoglobin, which carries oxygen in the blood, and myoglobin, which stores oxygen in muscles. When their work, which took nearly 20 years, won the 1962 Nobel Prize for Chemistry, the presenter called it "the opening of the door" to a new world of understanding living organisms.

In 1998 that opened door looks like a flood gate. Since hemoglobin and myoglobin, research has steadily added to the knowledge base of protein structures, and after 45 years scientists have solved about 7,500 proteins, each representing months or years of work. The human genome has about 100,000 protein coding regions — DNA sequences that are blueprints for the cell to manufacture proteins — which means there's more than 90,000 proteins left to solve (and that's only the human genome). In the next few years, this sequence data will become available. How can structural biologists keep pace? Solving the protein structures is a major part of realizing the potential offered by genome data for advances against disease, but simple math tells you unless something changes this work could take halfway to the next millenium.

"The challenge as a result of the genome mapping projects of the last and present decade," says Charles Brooks of the Scripps Research Institute, "is to deduce protein function starting from genomic sequence." This emerging field of activity is called "structural genomics," and Brooks has a keen appreciation for what's involved. Over a 20-year period — at Harvard, Carnegie Mellon and Scripps — he's played a lead role in using computational simulations to better understand the complex relations between sequence and structure. His work on the protein-folding problem, as this relation is known, has helped to build theory that explains protein folding in terms of the energy changes of the atom-to-atom interactions.

Two partially folded states and the native state (right) of a segment of streptococcal protein G. This segment's 3D structure is comprised of an alpha-helix (purple) between two beta sheets (yellow).
Download larger version (343KB) of this image.

While Brooks and others have begun to create bridges between genome data and protein structure, at this juncture, with a tidal wave of sequence data on the way, what's crucial is computational resources: "This work presents challenges on many levels," says Brooks, "and the complexity of the simulations involved outstrips current computational infrastructure."

One of the most important contributions simulations can make is to explore the "energy landscape" of a protein, a map of how the energy interactions of the atoms change as the protein changes shape, from unfolded to partially folded and, finally, the "native" state. In several huge computational studies, Brooks has mapped several of these landscapes. The precise information they provide extends what's learned from laboratory work and brings new insight to the relations between sequence and structure, the kind of insight that's critical for structural genomics.

Structural Genomics and Forcefield Models

Computational simulations that translate between a protein's sequence and its structure are based on a simple principle: a molecule wants to be at rest. In other words, a protein adopts as its native state the shape, out of all possibilities, that involves the least energy expenditure to hold the atoms in place. With this principle as guide, researchers can verify computational models by computing the lowest-energy state for proteins of known structure.

"Key to the success of any computational method that aims to provide sequence to structure predictions," says Brooks, "is that the energy function used to represent the biological system yields the native structure of known proteins as its lowest free-energy state." These computations verify the model's "forcefield" — the mathematical expression of energy relations between the atoms, which the model uses to compute the forces acting on the protein's structure, moving it from one folded shape to another. Researchers have carried out such verifications for a handful of proteins, but to deal with the information explosion of structural genomics, says Brooks, much more must be done.

In structural genomics, these forcefield models — such as CHARMM (Chemistry at HARvard Macromolecular Mechanics), widely used protein simulation software Brooks helped develop — could fill the gaps left by other computational methods. In an approach called "inverse folding," relatively low-resolution models, which run much more quickly and cheaply than detailed forcefield models, can in effect search the database of known structures, comparing sequences with recurrent folding patterns, to produce structures that are more or less in the ballpark (within three to five angstroms deviation from the native state).

Starting with these low-resolution structures, forcefield models can then refine the data, with the aim of producing structures as accurate as those now produced with x-ray crystallography and NMR methods (precise within less than one angstrom). "Only once we've achieved such resolution," notes Brooks, "can we be confident in using these structures for drug discovery and to assess the protein's biological function."

The critical test for a forcefield model like CHARMM is to map how the energy of a protein changes over the course of folding. Brooks' research group has carried out several such studies. Earlier work investigated an all-helical protein. In recent computations — using Pittsburgh Supercomputing Center's CRAY C90, CRAY T3D and CRAY T3E — Brooks and Felix Sheinerman mapped a segment from a small protein, streptococcal protein G, with different structural features, a combination of a helix and a folded sheet.

Free Energy Landscape of Protein G
This graphical representation depicts the "free energy landscape" of protein G, from computations by Brooks and Sheinerman. The protein's radius of gyration (vertical), how far it spreads out from its center, is plotted against the fraction of its interactions that correspond to native state (horizontal). Color contours show free energy, from the lowest (deep blue) of the native state to others (blue to pink to red to orange to yellow to green) of less favorable free energy. The L-shape of this landscape indicates a folding process of early collapse followed by protracted "search" through possibilities to arrive at native structure.
Download larger version (71KB) of this image.

The Energy Landscape of Protein G

Although explorations of a protein's energy landscape are very large computational problems, they represent an alternative to even more costly simulations that track the protein over the complete time span of its folding, tens to hundreds of milliseconds.

"There's two ways to think about solving the folding problem," says Brooks, "one is the brute force approach that lets the protein fold on its own time scale. The other way is to divide the space from unfolded to folded states into regions, then to sample within this much smaller region and connect the pieces to give a picture of the energy cost for folding. Because you confine the space into smaller chunks and then sample to convergence within each of those regions, you can — more rapidly than direct folding — get statistically significant information about the mechanism and thermodynamics of folding."

Brooks and Sheinerman's protein G computation involved about 100 gigabytes of data and the equivalent of a month of computing on 512 processors of the CRAY T3E. The results add to experimental knowledge about this protein and give a deepened interpretation of the early stage of folding. Experiment showed that part of the protein collapses quickly, and the simulations confirmed this. The experiment, however, suggested that these early folding interactions formed a single core that nucleated the rest of the folding reaction; the simulation suggests otherwise.

"We're able to look in detail," says Brooks, "at all the family of structures that, if you will, live in that region of the energy landscape. What we find is that those points of interaction identified in the experiment as a single nucleus form as several different nucleating points." By providing deeper insight, the simulation points the way to further experimental studies — underway now — that can more precisely characterize the folding interactions.

Such energy landscape computations, says Brooks, are beginning to show connections between short-range interactions, such as in helical regions, and longer range interactions, such as those that dictate a sheetlike fold. Unlike the helical structures, which fold through stages spread out fairly evenly over time, the sheet regions first collapse and then try many possible collapsed states to arrive at the final "correct" structure, a process represented by the L-shape of protein G's energy landscape.

Insights into the mechanisms of folding like this are key to structural genomics, but the enormous cost in computational resources of these simulations has held back progress. "Our understanding in this area is nascent," says Brooks. "We need to significantly expand the number of proteins we've studied computationally, and we need resources to simulate larger proteins. Infrastructure for the investigation of protein-folding free-energy landscapes must be increased or developed immediately if we're to make a timely impact in the quest to go from sequence to biochemically relevant structure."

Researchers: Charles L Brooks III, Scripps Research Institute.
Hardware: CRAY C90, CRAY T3E
Software: CHARMM.
Related Material on the Web:
The Scripps Research Institute
The Brooks and Hirst Group
Projects in Scientific Computing: How Proteins Get In Shape .
Projects in Scientific Computing: New Twists In Globs and Zippers .
Projects in Scientific Computing, PSC's annual research report.

References, Acknowledgements & Credits

© Pittsburgh Supercomputing Center (PSC)
Revised: July 15, 1998