* *

PSC DIRECTORS TALK ABOUT THE NEW TERASCALE COMPUTING SYSTEM

By Christopher Rogala, Managing Editor, HPCwire

Originally published in HPCwire (http://www.hpcwire.com), April 13, 2001.

The Terascale Computing System at Pittsburgh Supercomputing Center has officially begun serving scientists and engineers nationwide (see article 19887 "Initial Terascale System Enters Production Mode At PSC" in this week's HPCwire). HPCwire interviewed PSC Scientific Directors Michael Levine and Ralph Roskies about the characteristics and capabilities of the TCS. The following is the text of the interview.

HPCwire: What distinguishes the TCS from other supercomputer systems? What unique capabilities does it have?

LEVINE: The TCS lies in between what you can call "purpose-built" systems, which are built from the ground up to be supercomputers, and systems that on the other hand are relatively simple, glued-together forms of clusters. Distinguished from both these, the TCS is a combination of hardware systems and system software designed for a large market that includes high-end workstations but that nevertheless provides most of the reduced-cost benefits of commodity technology. The development is partly underwritten by others. We get the benefit of very powerful nodes not built specifically for scientific computing, but which are the best computational nodes around. Then we use them with an interconnect and overall systems software that are designed for large systems — and that have a substantial amount of engineering and development work put into them already.

Because of the way this system is constituted there isn't any single capability that you cannot find someplace else. But we think this is the best blend; this is where it has all been brought together. Someone might say, for example, "What about custom-built processors?" There are more powerful processor options, but they carry a huge cost. While no single capability of this system is unique, it brings together full software support, extremely powerful processors in extremely strong nodes, and a very low latency and high-bandwidth interconnect. At an overall system level, the software binds all this together so that users can deal with it as a single entity.

ROSKIES: Beyond this, the TCS is distinguished from most other systems by its scale. The scale confers unique capabilities on the system. It will be the most powerful system in the world dedicated to open scientific research. It will enable computations to be done much faster than on most other systems, and this will transform the research paradigm in several fields. Of course, the scale presents unique challenges in fault tolerance, which we have conscientiously and creatively addressed.

HPCwire: Why did PSC choose to work with Compaq to develop this system?

ROSKIES: In our discussions with users, they ranked single-processor speed as the most important single criterion, and in the overwhelming majority of benchmarks that we considered, the Compaq Alpha technology outperformed other systems we were considering. The other important thing about Compaq is that they clearly exhibited their interest to work with us and the research, computer-science community to add value to what they already had, to make this a system that meets the needs of the high-performance community.

LEVINE: There are other vendors with quality technology that seemed not to evince any pronounced interest in this project beyond providing technology. But nobody has commodity technology that has already been "productized" and tested at the scale we need. Compaq clearly said "We want to work on this." This is one of the reasons we went to Houston and visited Michael Cappellas, Compaq's president and CEO. He participated by phone in the NSF site visit, at which he told the review panel that Compaq is committed to the TCS initiative and will make it work. They're with us in recognizing that there's pent-up demand for this system — the nation's researchers need it, and we need to get it out on schedule.

HPCwire: What are the benefits of using the Tru64 operating system (Compaq's version of Unix)? Are there any potential drawbacks or limitations?

LEVINE: The benefit of using the Tru64 version of Unix is that there's a company behind it. You used to hear that to bring out a new machine you had to have at least one innovation, otherwise why bring out a new machine? But if you had too many innovations, you are severely reducing the possibility you will succeed in the right time frame. So there is a Goldilocks factor of just right. Tru64 is a production-tested operating system which has already been integrated with this whole system software. There is a major corporation with major development capabilities standing behind it, not only as it is but as it needs to be changed to accomplish a task, and that is happening.

ROSKIES: Regardless of what operating-system base we would choose to go with, we are likely to run into drawbacks and limitations. We think there are relatively few drawbacks and limitations here, and — what is more important -- we have people with the resources to fix it should that be needed. We want this system to be at, or even a little beyond, the bleeding edge, but on the other hand, we want it to work on schedule, because researchers need it. They're standing in the wings ready to use it, and the initial system is already loaded with work.

HPCwire: The August 3, 2000 news release concerning the award states that this system's architecture "pushes beyond simple evolution of existing technology." What does this mean?

LEVINE: As we've said, the TCS is a system that in scale alone pushes beyond where this hardware has been before or would have gone without this award -- and other Compaq awards (for systems at Los Alamos and CEA in France). Beyond that, we are working with Compaq to make various serious changes in overall system architecture, capability and even functionality. There is a large customer base for Compaq servers, which gives us the benefit of economies in development cost, but much of this development is targeted at the commercial sector, which has different requirements. So we're working closely with Compaq to refocus this technology at high-performance technical computing. This involves, for instance, refinements to the I/O structure, integration of hierarchical storage to effectively meet the requirements of scientific visualization, and other refinements.

HPCwire: Are the major advances brought about by this system's development mainly in the precision and accuracy of measurements? Or have entirely new doors been opened for research?

ROSKIES: There is sometimes a perception that advances in computational capability merely make it possible to refine the same computations that we do over and over, as if computational science is like polishing brass and we don't really get anywhere. This is not at all the case. We are going through a revolution, or a series of revolutions. As we increase the strength of computational platforms, not only in raw computational power, but also in communications, in memory, the amount of data a system can hold and manipulate comfortably, we keep opening up new fields that were not feasible fields for work before.

If you look back to when the NSF supercomputing program began, a little more than 15 years ago, the technological progress is remarkable. Our top performing machine before the TCS, the CRAY T3E, is 500 times faster with 1,000 times the memory of our first system, the CRAY X-MP. You can look at our web site (http://www.psc.edu/science/) to get a sense of some of the research opened up as a result in that span of time. In the biomedical area alone, the picture has changed dramatically. In structural biology, computational work was feeling its way and it's now making important contributions. Functional MRI work on mapping the brain was not underway 15 years ago, and now, with supercomputing and high-performance networks, we can do multi-modal, real-time imaging that involves complicated transformations from raw MRI data. This is now confined to research, but it will lead to clinical applications. In Earth science, we're understanding the geodynamo for the first time, why the magnetic fields reverse, because of computation. And we're on the verge of reliable storm-scale weather forecasts, which you can't conceive of as being possible at all without these computational advances.

From the T3E to the TCS is another big step — 12 times more computational power, more than 40 times the memory. We're already seeing major new work, large-scale computations that push the envelope. We expect to see important new work with very practical consequences in the area of power generation, where the simulation technology is ready to be a design tool that will improve the efficiency of power turbines, and that requires this new level of computational capability. We're going to see new work too in dynamic visualization technologies, event re-creation and simulated reality.

HPCwire: What do you foresee to be the practical benefits of research to be done using this system?

LEVINE: To some extent, I think we've already addressed this. In power turbines, for example, a very small improvement in efficiency translates to significant reductions in the cost of power, easily billions of dollars over years. In general, the immediate, direct beneficiaries of this system will be academic scientists, but in the long run the primary benefits will flow to the country as a whole, in practical ways we can't really forecast. We know that the span from basic research to practical impact is in the range of ten years. We know that there's a major impact on the economy. And we know that U.S. leadership in basic research is a key factor in our economic strength. The President's Information Technology Advisory Committee addressed these questions, and a major finding of their report, the PITAC report (http://www.ccic.gov/ac/report/), released in 1999, was that we need a system like the TCS to increase long-term, fundamental research.

HPCwire: This system is part of the Partnership for Advanced Computational Infrastructure (PACI) program. What are the goals of PACI? How is PSC furthering those goals?

ROSKIES: The goals of the PACI program are identified at the web site: (http://www.interact.nsf.gov/cise/descriptions.nsf/pd/paci/) In a nutshell, "PACI provides the foundation for meeting the expanding need for high-end computation and information technologies required by the U.S. academic community." The TCS is a high-end system that is part of the portfolio of resources which PACI provides. It is designed to be a balanced system in terms of processor speed, memory, communication and storage systems, and it supplements the capabilities that continue to be available through the PACI partnerships centered at NCSA and SDSC. PSC is also working with NCSA and SDSC to provide a common interface, wherever possible, to resources at all the PACI sites, and to coordinate things like user support and training. PSC will also deploy tools that are being developed by the PACI development teams.

HPCwire: Will PSC be collaborating with any other supercomputer centers?

LEVINE: High-performance computing has always been a partnership effort, and in addition to the collaborative spirit of the PACI program, our work on the TCS has benefited from good working relationships we have with many of the national laboratories, including Sandia, the National Energy Technology Laboratory, Lawrence Livermore, Oak Ridge and Los Alamos as well as other university-centered sites here and abroad.

ROSKIES: We've also worked closely with the national computer science community, who have provided support and guidance at many stages. We've had wonderful support from Jim Morris, who heads the School of Computer Science at Carnegie Mellon, and who has served as a liaison for us to others who have provided invaluable input. We have a TCS computer science advisory group that has been a sounding board and source of ideas.

HPCwire: What projects have been planned/initiated so far?

ROSKIES: As we speak, our "friendly user" period is winding down, and we're pushing the usage level on our initial TCS system of 256 processors to the 90 percent level. Our schedule with NSF called for "production" to begin by February 1, but this machine performed well enough during testing, with virtually zero downtime, that people were doing production research almost immediately after we officially accepted it from Compaq in December (Dec. 22). NSF has been pleased and acknowledged as much with their news release (Jan. 29).

Our friendly users, who are now exploiting the initial TCS, are researchers from across the country, who can take advantage of this availability to port their codes and push their work. Among these projects, we've worked with Mike Norman's cosmology group at the University of California, San Diego to enable a whole machine, multi-day run that is expected to be the most realistic simulation undertaken of star formation in galaxies. We've also been working with several members of Jaron Lanier's Tele-Immersion collaboration, who plan to prototype their software on the initial TCS. Klaus Schulten at the University of Illinois has done large molecular-dynamics simulations to show how membrane tension allows organisms to sense touch and sound. Doyle Knight from Rutgers used the TCS to show that the Large-Eddy Simulation approach can accurately predict the flowfield of a supersonic expansion-compression corner.

Other major topics in the friendly-user period have been coupled structures and fluids, general relativity, quantum chemistry, computational fluid dynamics, molecular dynamics, QCD, and materials science.

HPCwire: In your opinion is the development of supercomputing likely to continue along current lines? Or will development in other areas. (networking, wireless communication) lead to a major change in the way that high performance computing is carried out?

LEVINE: In recent years high-performance computing has been able to realize unusual gains in capability by clever and constructive use of technologies -- in many cases, commodity technologies — designed for other uses. By extension, we can expect benefits from the other things people are working on, such as wireless. Wireless telephones and wireless web are growing by leaps and bounds now. Both the technology that goes into those things — very low power processors, for instance — and the services around it will have analogs in computational science. These benefits accrue not only in the base technological support for high-end computational engines but also for the equally important benefits that it provides to the intellectual enterprise of science and everything that flows from that.

With the TCS, we're currently making progress from technology that comes from enterprise computing — for business and e-commerce — and we're living off that, but that doesn't mean that there are not additional benefits to be gained by new technologies, advanced architectures, targeted specifically at high-performance computing. The underlying technologies for these Compaq boxes include a great deal that's not normally visible — the fabrication facility, the design facility, the solid-state physics, which enables them to build Alpha chips. These technologies may provide some fruitful avenues for advancing high-performance technical computing, but maybe not simply on the basis of packaged products. There may be different ways of building machines, not necessarily radically different — such as, say, pieces of the circuits of the Alpha that can be used independently. The next generation of processors is rarely a total change from what has gone before.

If you look at networking, the current growth area is research in what is called WDM, wavelength division multiplexing. People are finding it practical now to increase the amount of information on an individual fiber 10 to 100 times, and maybe in the end 1,000 times, using in effect signals of different colored light. It's like radio. There are many stations. Each one has a certain bandwidth and transfers information, and you can tune in to different ones. Similarly, you can send light in different colors on these fibers and multiply the amount of data you can send on a given number of fibers.

Not only can this technology vastly improve our ability to ship huge amounts of data across thousands of miles, but also it can ease the problems of building huge machines. The TCS-1 will be knit together with over 3,000 long cables, each about the diameter of your little finger. Increasing the bandwidth of those connections while reducing their size by a factor of 20 would certainly help. So the use of that technology, first of all, for long-distance fiber and then for other things will provide the economic base for R&D, and then high-performance computing will be opportunistic, as it has been in the past.

NOTE: For information on applying for time on the TCS, see http://www.psc.edu/grants/paci.html .

Copyright 1993-2001 HPCwire. Redistribution of this article is forbidden by law without the expressed written consent of the publisher. For a free trial subscription to HPCwire, send e-mail to trial@hpcwire.tgc.com.

* *