Creating Cyberinfrastructure, 2007
The TeraGrid is the world’s most comprehensive distributed cyberinfrastructure for open scientific research. As a major partner in this National Science Foundation program, PSC helps to shape the vision and progress of the TeraGrid.
BigBen, PSC’s Cray XT3, the first Cray XT3 anywhere, is the TeraGrid’s lead performer among “tightly coupled” architectures. In late 2006, PSC more than doubled BigBen’s capability, replacing existing processors of the 2,090-processor system with Opteron's top-end, dual-core (2.6 GHz) chip. The upgrade doubled the processor count to 4,180 and boosted peak performance to 21.5 teraflops, while also doubling memory (from two to four terabytes).
BigBen’s primary advance is its superior inter-processor bandwidth, the speed at which processors share information. This is a large advantage for projects that demand hundreds or thousands of processors working together. BigBen has demonstrated performance 10 times or more better than prior tightly coupled systems on many applications and is a champion at “scaling” — the ability to use a large quantity of processors without seriously reducing the per-processor performance, (e.g., “Bursts of Stellar Turbulence,” pp. 34-37).
“The Cray XT3 has proven itself as a scientific platform of exceptional capability,” said PSC scientific directors Michael Levine and Ralph Roskies. “Since becoming a production resource on the TeraGrid, this new system has made possible a number of remarkable achievements.”
In September, David Moses, co-founder and former chief operating officer of Gaussian, Inc., joined PSC as executive director, managing day-to-day internal operations and overseeing a scientific and technological staff of about 75 people. His hiring culminated an extensive national search.
“We are very pleased that David will help us to carry forward our leadership in high-performance scientific computing,” said PSC scientific directors Michael Levine and Ralph Roskies. “His energy, enthusiasm, solid judgment, and extensive experience in the organizational as well as the technical aspects of large-scale, collaborative computational science will catalyze our efforts and benefit the national community of scientists with whom we work.”
PSC staff whose work contributes to TeraGrid include (l to r): Laura McGinnis, Michael Schneider (seated), Nathan Stone, Josephine Palencia, Sergiu Sanielevici, Shandra Williams, Kathy Benninger, Rob Light, David O’Neal (on floor), R. Reddy, Derek Simmel, Rich Raymond, Jim Marsteller.
PSC is actively involved in TeraGrid leadership. Scientific director Ralph Roskies serves on the executive steering committee of the Grid Infrastructure Group that guides TeraGrid, and co-scientific director Michael Levine is PSC principal contact as one of 11 TeraGrid resource-providers.
Other PSC staff also have taken leadership roles in TeraGrid. As Area Director for User Support, PSC’s Sergiu Sanielevici manages TeraGrid’s user-support services and coordinates the ASTA program — Advanced Support for TeraGrid Applications. Jim Marsteller, who heads PSC’s security, chairs the TeraGrid Security Working Group, responsible for risk assessments and incident response. PSC’s Laura McGinnis co-leads the TeraGrid team that is developing an education, outreach and training program called “HPC University,” and will chair poster sessions at the TeraGrid ’08 conference. PSC director of systems and operations, J. Ray Scott, leads the TeraGrid effort in Data Movement.
Other recent PSC work in support of TeraGrid includes:
- Securing Community Accounts: In a major effort, PSC systems engineer Aaron Shelmire implemented a security model, adapted from NCSA-developed software, that reconciles the community-wide reach of TeraGrid Science Gateways with the secure environment of a large-scale system. TeraGrid sites are consulting with PSC for details they can emulate.
- CTSS Compliance: PSC grid specialist Derek Simmel identified a problem in TeraGrid’s Coordinated TeraGrid Software and Services (CTSS) that obstructed user authentication. He devised a fix, solicited comment from involved working groups, and created a change plan implemented via modification of each resource provider’s CTSS configuration.
- Networking: The TeraGrid Data Working Group recommended that all TeraGrid sites deploy HPN-SSH, a security protocol for network communications with performance enhancements — developed by PSC’s Chris Rapier. Rapier coordinated to assure appropriate installations. Ben Bennett and others of PSC’s network staff added “multi-threading” to OpenSSH’s encryption operations, resulting in 40-percent speedup in transfer rate. PSC staff also deployed a TeraGrid version of its NPAD diagnostic service, co-developed with NCAR, which analyzes and diagnoses network path-failures, on TeraGrid’s network-monitoring computers.
- Speedpage: PSC staff updated “speedpage,” a PSC-created TeraGrid resource that measures file-transfer performance among TeraGrid sites. This includes a naming convention that PSC staff designed and shepherded into place. PSC is migrating speedpage to a new PSC information server, where it will perform better and share its database with other performance-measurement tools.
- File Systems: Experienced in file systems across a range of architectures, PSC staff worked with Indiana University to implement the flexible, open-source Lustre-WAN file system, deployed on PSC’s TeraGrid resources, on test machines at Indiana. Performance in writing data between PSC and Indiana is only 20-percent reduced compared to similarly configured local file systems.