Petascale Applications Symposium

Multilevel Parallelism and Locality-Aware Algorithms

Pittsburgh Supercomputing Center, June 22-23, 2007

Symposium chair: Nick Nystrom, This email address is being protected from spambots. You need JavaScript enabled to view it.

The Pittsburgh Supercomputing Center hosted a Petascale Applications Symposium addressing Multilevel Parallelism and Locality-Aware Algorithms, at PSC on June 22-23, 2007. The Symposium featured invited presentations and panels by leading developers of highly scalable software infrastructure.

The Symposium targeted researchers planning to respond to NSF's recent solicitation, Accelerating Discovery in Science and Engineering Through Petascale Simulations and Analysis (PetaApps), available at www.nsf. gov/publications/pub_summ.jsp?ods_key=nsf07559 . It provided a timely forum for exchange of ideas and for additional team building. Several breaks and a reception on Friday evening provided valuable time to meet the speakers and other participants.

The omnipresence of multicore processors in computers ranging from notebooks to leadership- class systems is only one step in our journey to petascale systems which will comprise vast numbers of tightly coupled, manycore processors. Petascale systems will enable unprecedented discoveries throughout science and engineering. Their deep hierarchy of cores, processors, cache, memory, and interconnect will create exciting opportunities for new algorithms that exploit the architectural hierarchy to increase scalability, efficiency, and accessible simulation sizes. Scaling applications to O(106) cores, coupling multiphysics and multiscale applications, numerical convergence and stability, analysis and visualization, and hybrid programming models are only a few examples of the challenges and opportunities that will require the combined attention of multidisciplinary development teams.

Proceedings (PDF)

Software for Multicore Processors
  Geoffrey Lowney, Intel Fellow, Digital Enterprise Group, and Director, Compiler and Architecture Advanced Development, Intel Corporation

Future processors developed by Intel will have more than one core on a die. Multi-core processors will bring tremendous computing power to the desktop PC, enabling new classes of applications. These applications will be written using parallel programming techniques. In this talk I will present the tools Intel has developed to support parallel programming and discuss some of the ideas we are exploring to enable widespread adoption of parallel programming.
Slides (PDF)
Partitioned Global Address Space Languages for Multilevel Parallelism
  Katherine Yelick, U.C. Berkeley and Lawrence Berkeley National Laboratory

Languages like UPC, Titanium, and Co-Array Fortran map well onto both shared and distributed memory multiprocessors. They use direct memory operations (load/store) in shared memory and RDMA support on clusters, which takes advantage of the best hardware available on many machines. I will describe some of the experience using these languages on shared and distribute memory machines. Titanium has a small amount of language support for multi-level parallelism, and the Berkeley UPC compiler supports a similar language extension, which allows programmers to distinguish between pointers that refer to shared (on-node) memory and remote (off-node) memory. Titanium also uses a novel hierarchical pointer analysis in Titanium that can distinguish between an arbitrary number of levels, which can be used to optimize programs for multi-level machines. In this talk I will give an overview of these results and some opportunities for further extending these languages to support deep machine hierarchies.
Slides (PDF)
Targeting Multicore Systems in Linear Algebra Applications
  Alfredo Buttari, Innovative Computing Laboratory, Unive rsity of Tennessee, Knoxville

It is difficult to estimate the magnitude of the discontinuity that the high performance computing (HPC) community is about to experience because of the emergence of the next generation of multi-core and heterogeneous processor designs. The work that we currently pursue is the initial phase of a larger project in Parallel Linear Algebra for Scalable Multi-Core Architectures (PLASMA) that aims to address this critical and highly disruptive situation. While PLASMA's ultimate goal is to create software frameworks that enable programmers to simplify the process of developing applications that can achieve both high performance and portability across a range of new architectures, the current high levels of disorder and uncertainty in the field processor design make it premature to attack this goal directly. More experimentation is needed with these new designs in order to see how prior techniques can be made useful by recombination or creative application and to discover what novel approaches can be developed into making our programming models suffi ciently flexible and adaptive for the new regime. Preliminary results show that, whenever an operation can be expressed as a DAG, asynchronous and dynamic scheduling of subtasks represents a powerful yet very flexible approach.
Slides (PDF)
Combinatorial Scientific Computing on Petascale Machines
  Ümit V. Çatalyürek, Ohio State University

The multicore processors and memory hierarchies of petascale machines will require algorithms that can exploit both parallelism at multiple levels, and locality in memory accesses. We discuss how load balancing tools need to be designed for such machines. We also consider graph and hypergraph models for scheduling data accesses and iterations to enhance performance.
Slides (PDF)
A WAN-Capable I/O Channel for Petascale Resources
  Raghurama Reddy, Pittsburgh Supercomputing Center
Leopold Grinberg, Brown University

We will present compelling user experiences with an innovative I/O solution demonstrating that petascale applications developers may expect not only scalable I/O on the next generation of machines but also unprecedented interactive access to running simulations. Portals Direct I/O ("PDIO") middleware in use on PSC's Cray XT3 has facilitated this through an end-to-end parallel infrastructure that maximizes throughput even to remote nodes on a Wide Area Network while hiding network latency. We will show performance data and discuss portability and applicability to petascale machines.
Slides (PDF)
Petascale and Multicore Programming Models: What is Needed
  L. V. Kalé, Parallel Computing Laboratory, University of Illinois at Urbana-Champaign

The almost simultaneous emergence of multicore chips and petascale computers presents multidimensional challenges and opportunities for parallel programming. What kind of programming models will prevail? What are some of the required and desired characteristics of such models? I will attempt to answer these questions. My answers are based in part on my experience with several applications ranging from quantum chemistry, biomolecular simulations, simulation of solid propellant rockets, and computational astronomy.

First, the models need to be independent of number of processors, allowing programmers to over-decompose the computation into logical pieces. Such models, including 15 year old Charm++, enable intelligent runtime optimizations. More importantly, they promote compositionality. Second, building on this compositionality, one needs a collection of parallel programming languages/models, each incomplete by itself, but capable of inter-operating with each other. Third, many parallel applications can be "covered" by simple, deterministic, mini-languages which lead to programs that are easy to reason about and debug. These should be used in conjunction with more complex but complete languages. Also, domain specific frameworks and libraries, which encapsulate expertise and facilitate reuse of commonly needed functionality, should be developed and used whenever feasible. I will illustrate these answers with examples drawn from our CSE applications, and from some relatively new programming notations we have been developing.
Slides (PDF)
Multilevel Parallel Geometric Proximity Detection for Unstructured Meshes
  H. Carter Edwards, Sandia National Laboratory

Geometric proximity detection is a performance critical capability within Lagrangian applications such as Presto, an explicit dynamics analysis code at Sandia National Laboratories. Geometric proximity detection is also a significant capability for coupled physics applications to project field data between independent unstructured meshes. A new geometric proximity detection algorithm is being investigated to support multilevel parallelism, which is likely to be required for future petascale systems with networked manycore nodes. This prototype linear octtree algorithm with a hybrid message passing and threaded parallelism will be presented.
Slides (PDF)
Scalable Software for Many Core Chips: Programming Intel's 80-core Research Chip
  Tim Mattson, Principal Engineer, Microcomputer Technolog y Laboratory, Intel Corporation

Moore's law is alive and well. The semiconductor industry will continue to double the number of devices that can be integrated into a commercial CPU with each new process generation. How these devices will be used, however, is changing. A growing emphasis on power density coupled with limitations in fundamental device physics has ended the era of steadily increasing frequencies. The future expression of Moore's law will be in terms of the number of cores. We are at 4 cores today, and as we move forward, core counts will increase to 8, 16, 32, and beyond.

This move to many core chips will shake the industry to its foundations. Programming, chip layout, testing, memory architectures ... everything must be reconsidered as we transition to a new many core world.

Intel is aggressively working to understand this brave new world of many core processors. We have been researching ways to build a tiled CPU where every facet of the chip is scalable. Our first chip in this family of research chips was announced early this year. It is an 80-core microprocessor that delivered over 1.37 single precision TFlop/s peak performance at 4.27 GHz. While this project was not designed to support software development, we still took the time to write four application kernels for this chip. Running the chip at 4.27 GHz we obtained the following results:

Stencil 1.00 TFlop/s
SGEMM 0.51 TFlop/s
Spread sheet 0.45 TFlop/s (asymptotic)
2D FFT 20 GFlop/s

In this talk, we will discuss the design of this chip, the software developed to test the chip, and the implications this program suggests for future many core chips. We will close by considering how many core chips such as this would fit into a large scale, petaFLOP computer system.

Slides (PDF)