Applications Needs for High-Performance Scalable Systems

Submitted to:
America in the Age of Information:
A Forum on Federal Information and Communication R&D
sponsored by the
Committee on Information and Communications R&D (CIC)
of the National Science and Technology Council (NSTC)
By:
Michael J. Levine and Ralph Z. Roskies
Scientific Directors
Pittsburgh Supercomputing Center
4400 Fifth Avenue
Pittsburgh PA 15213
(e-mail levine@psc.edu, roskies@psc.edu)

We argue that a focus on applications' needs for high-performance, scalable computing systems and software, as developed through multidisciplinary work in open, research environments at university-based centers for high-performance computing, constitute an additional "tasking paradigm", complementary to the calling out of mission oriented and other specific goals. Work in such centers focuses beyond the 3-year, short-term interest limit of most corporate research providing a longer-range view. Its results not only drive advanced technology in a balanced fashion but provide advances in most of the Grand and National Challenges.

The past several years have demonstrated the use of advanced computation to cross the "threshold of realism" in numerous fields. In some, it is the "simple" increase in resolution (supported by very large memories and very high capability machines) to levels appropriate to the problem which finally produce qualitatively and quantitatively correct results though the underlying theory was "correct" all along. (For example, see the recent work by Bleck et al (University of Florida) where increased resolution has clarified previously confusing structure in the Gulf Stream flow). In others, it is the ability finally to include sufficient basic theory to "get it right". (Examples include the Grand Challenge work on cosmology, headed by Jerry Ostriker (Princeton University). While in still others it is the long awaited transition from 2-dimensional to 3-dimensional simulations. (e.g. the work by Peskin and McQueen (Courant Institute) which has finally given us a realistic three dimensional model of blood flow in the heart). In all cases, the advance requires the support of very large memories and machines of very-high overall computational capability.

This class of applications not only advances scientific knowledge and engineering practice but requires the capabilities of advanced computing systems well beyond those designed to provide particularly cost-effective capacity appropriate for the vast bulk of computing needs. In hardware, this drive is for more capable single processors, their aggregation locally in as large numbers as is affordable and their aggregation broadly and perhaps episodically in as large numbers as our collaborative capability will support. By focusing on the "whole systems" required by "full applications" we are tasked to produce balanced systems.

Although centers for computational science are unlikely to produce the advances in basic technology which are the proper ground for materials scientists, electrical and computer engineers at universities and vendors, those centers can and do drive those developers of basic technology and play a key role in full system creation and targeting. While individual vendors are often quite limited in the scope of their offerings and have limited motivation to create interoperable systems, centers have the power to motivate vendors and integrate heterogeneous systems.

Given limited funding, the very expensive, advanced resources, running up to as many as 1000 high-performance processors in a single system, are a scarce resource. However, they alone provide the kind of capability required to do work beyond the 3-year horizon. Their expense must not obscure the important role they play. These benefits, in scaling to very large machines, are as important to advances in computational science as are the benefits of scalability to small numbers of processors. It is that limit that broadens the base of available software. In addition to their expense, these large systems present substantial challenges to OS, tool and applications programmers particularly as their memory and multi-processor hierarchies become more complex.

Aggregation of large numbers of machines for problems able productively to use such massive resources, can provide a powerful "zoom" capability layered atop large individual machines. The advanced problems in wide-area networking and heterogeneous systems as well as the software problems common with individual machines attendant on such "colony machines", tax the most advanced computer science research projects.

Although the work at such centers benefits not only the advancement of scientific knowledge but also vendor supplied computing technology and the production computing tools used by industry, it is the proper venue for the research paradigm exemplified by our leading research universities. The production of software, at all levels, is not only difficult but at the leading edge it is "speculative": Many approaches do not succeed! However, computer scientists and computational scientists found in many of our universities work together to delimit and to solve problems in both fields. Examples abound especially in the Grand Challenge collaborations. As particular cases, we can cite the work by Dennis Gannon, (University of Indiana), whose work on pC++ has been strongly influenced by the applications demands of cosmologists in the Grand Challenge on Computational Cosmology, and whose work has greatly enhanced the ability of these applications codes to run cooperatively on multiple platforms. Or we can cite the work of Bernd Bruegge (Carnegie Mellon University) whose software engineering class has designed sophisticated user interfaces for the geographical information systems which are built upon the Grand Challenge in Air Quality modeling. The compute engines behind these interfaces run on powerful distributed platforms like the Cray T3D, Cray C90 and workstation clusters.

In addition to the work which advances the capability of computing hardware and software resources, there is a history, at such centers, of major efforts in the creation and porting of advanced codes. Here too, the research paradigm provides benefits. This adapting of applications codes to progressively more capable (and, more complicated) systems is, at its core, research work. It is not the "sweet spot" software targeted at millions of, in most cases, even thousands of potential users. However, it is the leading edge in advancing technology where dedicated scientists, engineers, computer scientists and craftsmen discover the new algorithms and methodologies which at first yield new scientific results and later provide the advanced capabilities for commercial exploitation.

To summarize these thoughts in the framework of the identified cross-cutting issues:

Realistic Technology Expectations for 2000 and beyond and the Federal Role:

With proper support for large systems, by the year 2000, we will see many diverse scientific users productively exploiting single systems composed of thousands of workstation-class multi-processors for the solution of single scientific problems well beyond those which are embarrassingly parallel. Efficiency will be attained by the experience gained with load balancing and communications strategies developed in the solution of important scientific research problems. Effective use of these systems will require work in computer science and computational science.

These large systems will not be at the market sweet spot which, for example, SMP vendors are likely to develop on their own in the next 5 years. Rather, they will be the window for the systems to be developed in the next decade. As such, they need federal support in the intervening period.

Implementation Strategies:

The research community needs platforms to develop the experience for productive use of such large systems. This entails both establishing a few centers with large computational power, and the establishment of high speed links to tie such centers together and to the high-end users. While some of these centers will be in support of mission-oriented goals, others must be in open research environments.