Pittsburgh Supercomputing Center Scientists Patent Software for Protecting Supercomputing Results Against System Failures

PITTSBURGH, April 8, 2013 — Scientists at Pittsburgh Supercomputing Center (PSC) have patented ZEST, a piece of software that takes a rapid “snapshot” of a supercomputer’s calculations as it works. ZEST greatly speeds the ability to store complex calculations as a hedge against a system failure, saving precious supercomputing time and slowing calculations down far less than current methods.

PSC co-inventors of ZEST included Paul Nowoczynski, Jason Sommerfield, Nathan Stone, and Jared Yanovich.

Just as we all hit “save” as we work, scientists carrying out vast computations such as those required for detailed weather predictions or earthquake science need to periodically store — “checkpoint” — the machine’s computational state. In the case of a system malfunction, this allows them to avoid having to start from the beginning after hours or days of work.

The problem, according to J. Ray Scott, Director of Systems and Operations at PSC, is that retrieving and storing these data takes time away from calculation, which is carefully rationed to researchers using highly in-demand supercomputers. In fact, he adds, over the last seven years the memory available in the largest machines has increased about 25-fold, while the capacity for retrieving that memory has increased only about four-fold.

“If you have a large job, checkpointing the run often means writing out tens of terabytes of data” — enough to fill about a thousand new iPads, Scott says. “This takes a nontrivial amount of time. The whole time you’re doing the checkpoint, you’re not using the computer.”

The ZEST software works by tightly managing the supercomputer’s disk drives, continuously routing checkpoint storage to disks that aren’t being used for computation.

“Every disk drive holds up its hand and says, ‘I can take these data now,’” Scott explains. This “pull-based model” ensures the checkpointing conflicts as little as possible with the computer’s own use of the drives. “You’re always writing to whomever’s the most available.”

ZEST is far more efficient than current methods, which “push” data to disks whether or not they’re ready to receive it. ZEST has demonstrated 90 percent of the theoretical maximum efficiency of writing data to drives; currently available commercial systems have efficiencies of 25 percent or less.

About PSC: Pittsburgh Supercomputing Center is a joint effort of Carnegie Mellon University and the University of Pittsburgh together with Westinghouse Electric Company. Established in 1986, PSC is supported by several federal agencies, the Commonwealth of Pennsylvania and private industry, and is a partner in the National Science Foundation XSEDE program.

# # #