Parallel Software and the PSC Community: Breaking the Barrier
Sergiu Sanielevici, Manager of Parallel Applications, PSC
sergiu@psc.edu
- The Challenge of Realism
From cosmology to biochemistry, from regional severe weather prediction to global climatology, from automotive and aerospace engineering to the creation of new materials and processes, there are thresholds of realism and timeliness below which simulation results are insufficient, irrelevant, or misleading. A numerical model must describe enough detail to capture the essential dynamics that drives the physical reality.
A large, powerful, balanced, and stable supercomputing system (hardware and system software) is only the beginning. It is up to each scientific or engineering application to use the raw potential of the supercomputer efficently enough to meet the requirements of the practitioner, such as: "Evolve a 50,000 atom system over 5 nanoseconds of simulation time within 8 hours of wallclock time" or "Complete the 7-hour storm-scale weather forecast run in 20 minutes wallclock time".
- PSC's T3D: First to Realism
In many areas of science and engineering, the first-ever crossing of this
"threshold of realism" was achieved in the period 1994-1996, on the PSC's
massively parallel Cray T3D. Breaking the realism barrier is heralded by
simulations that reproduce previously unexplained experimental results and
that, better still, predict results yet to be confirmed by improved
experiments. For example, Brooks et al. discovered a small amount of helical
structure, falling below experimental detection limits, which appears in the
dynamics of the unfolded state of a protein and then serves as the nucleus for
protein folding.
Kollman et al. overcame limitations of previous simulations of DNA and RNA,
which had erroneously predicted these molecules to be unstable and failed to
reproduce the observed convergence of variant forms of DNA.
Similarly, the MICOM group's T3D simulations were the first to achieve sufficient resolution in modeling of circulation in the Atlantic Ocean to correctly predict the course of the Gulf Stream.
<
Woodward's and Porter's simulations of compressible convection beneath the
surface of the sun revealed new, unexpected features while showing other
features predicted by theory, which all previous simulations had lacked
sufficient resolution to test. Only at the resolution afforded by the full
512-processor T3D were Taylor et al. able to confirm that interactions among
booster plumes were responsible for the malfunction of a Delta II rocket.
During their Spring 1996 operational testing campaign, Droegemeier et al.
were able to run their ARPS model at 3 km resolution over the prediction
region, resulting in successful forecasts on eight of the 10
days storms occurred, an unprecedented success rate. However, a
seven-hour forecast took about 100 minutes to complete, too
long for true operational use. That will require the forecast run to
complete in 15-30 minutes, the resolution will have to be 1 km,
and the forecast region will also need to be increased.
- Developing Efficient Parallel Applications: Challenges and Results
While it is now clear that large-scale parallel systems are needed to achieve realistic scientific and engineering simulations, it is also true that the task of developing efficient applications software for such machines is a complex one. The problem must be mapped onto the architecture of the machine in such a way that each processor stays as busy as possible (and its design features are used to the maximum); when results need to be returned and new inputs fetched, each processor should always be able to do so to or from the closest possible memory store (minimizing the time it must wait until it can resume computation). Communication between processors, as well as I/O, amounts to overhead cost, and thus should be managed thriftily. Complicating matters, parallel supercomputers are still such that maximum performance on one given brand can only be achieved at the expense of some portability to other machines.
Given enough time, an applications developer could revisit the simulation algorithm and modify or replace it to meet these requirements. Even though many PSC user groups develop and use their own codes, few can afford to scrap programs developed over many years for workstations or vector supercomputers. This is even more true, of course, for software vendors and community code developers and for those PSC users whose research relies on such codes. Yet the above examples demonstrate that many PSC users have succeeded in achieving scientific breakthroughs on the T3D... as do the following numbers:
Our T3D has served 1035 users working on 175 projects since 1993. Weekly system utilization (processor-hours used vs. total possible) consistently exceeded 80% over the past year, approximately 25% of which went into runs using the full set of 512 processors. The average performance over the 20 most utilized codes, each code's performance being weighted by its share of the yearly usage, is 23 Mflops/PE (the theoretical peak being 150 Mflops/PE). The PSC record for a complete scientific code is held by Alan Benesi's NMR analysis code, whose 76.5 Mflops per processor scales linearly to 39.2 Gflops on 512 processors (out of 76.8 Gflops theoretical peak).
The successor to the T3D, our 256-processor Cray T3E, has operated in friendly user mode since October 1996 and is scheduled to go into production April 1. So far, it has served 132 users, working on 41 projects. Its weekly utilization has consistently exceeded 80%, of which almost 25% involved the full set of 256 processors. The speedup over the T3D depends on the code as well as on the specific problem dataset, being affected to varying degrees by the doubling of the processor speed, dual instruction issue, the addition of a secondary cache, other differences in memory access, and the still evolving maturity of library optimizations. We have seen speedups ranging from a factor two to a factor ten.
- The Key: An Open Development Community
How were so many PSC users able to produce science despite the complexities of applications software development for large-scale parallel computers? We have managed to pull together an open parallel applications development community involving people and groups both inside and outside PSC, who have a stake in the effective use of this machine. The joint PSC-Cray Parallel Applications Technology Program (PATP) established an extraordinary feedback and cross-fertilization loop by which facts, fixes, tips and tricks flow rapidly from Cray's T3D and T3E teams to PSC's internal application developers and user consultants, and on to the PSC's users and development partners.
In return, PSC provides the vendor with large amounts of accelerated information on the behavior of their products under a heavy, heterogeneous workload of real, large-scale, demanding applications. This leads to the rapid and continuous performance improvement of both the systems and tools provided by the vendor, and of the applications codes. Both for the T3D and the T3E, PSC's codes were the first fully fledged applications to run on simulators and prototype hardware in Eagan and Chippewa Falls. PSC's close ties with the world-leading School of Computer Science at CMU play an important role in this positive-feedback process. For example, CMU SCS researchers Gross and Stricker have made important contributions to understanding and improving the routing mechanisms in the T3D's interconnection network, and the performance of various memory access patterns on the T3D and T3E.
Internally at PSC, parallel applications development and research is being done within the Parallel Applications group, the Biomedical Applications group, and the User Consultants group. A group of Cray Research parallel applications specialists has resided at PSC since 1993. Common meetings and activities across these groups ensure that successes and problems, tricks and caveats spread rapidly to all who need to know. People from all these groups lecture and conduct hands-on lab sessions at PSC's MPP Workshops, offered since 1993, with online lecture notes also available on the Web. These workshops have played a crucial role in disseminating parallel programming know-how. Since 1993, PSC has taught 26 workshops with a parallel programming focus, attended by more than 500 scientists, engineers, students, educators, and military personnel.
Packages that have been ported or optimized at PSC under PATP include AMBER, CHARMM, GAMESS, MNDO94, X-PLOR, Shake'n'Bake, and Spectrum-3D. In many cases, this work has focused on scientific advances leading to functional improvements, such as the implementation of the PME method in AMBER (which enabled Kollman et al. to break through the realism barrier, as mentioned above). Other packages and utilities were developed at PSC, such as the Biomedical Applications group's parallel genome search, parallel functional MRI, and parallel neural simulator packages, and parallel libraries for simulated annealing, FFT's, eigenvalue problems, finite element problems, and file I/O. Numerous public domain and research codes have received parallelization and optimization assistance, including Droegemeier et al.'s ARPS program and a number of materials science codes.
- You ain't seen nothing yet!
As parallel computing begins to enter the mainstream, and vendors are beginning to implement the "Lego approach" to building parallel systems that need to scale from the deskside quad to the thousand-processor monster, the challenge of parallel software development is becoming omnipresent. Unprecedented opportunities open up for computer simulations to punch through the barriers raised by more and more realistic problems. But as we have seen, the price of admission is to develop and optimize your application to take effective advantage of these new architectures.
PSC's parallel applications development community is tackling this challenge, and is eager to welcome new members. We are focusing on T3E optimization, while experimenting aggressively with the emerging hierarchical memory architectures. We keep assessing the state of programming styles and tools, such as HPF and parallel-C++ versus message passing, and the circumstances in which one or the other may be the best choice for a given code. We are setting up exciting experiments with the computer science community, in areas such as metacomputing, problem solution environments, and integrated simulation execution/databank decision support systems. We are interested in developing joint projects with disciplinary computational scientists, aimed at achieving scientific breakthroughs and building parallel software infrastructure.
Please contact Sergiu at (412) 421-2606 or send email to sergiu@psc.edu, to discuss parallel applications development projects.