Social Security Number Vulnerability Findings Relied on Supercomputing

Access to a large-scale parallel system at Pittsburgh Supercomputing Center made it possible to overcome difficulties and greatly accelerated time to solution.

PITTSBURGH, PA., July 8, 2009 – Information available on the Internet can in certain cases be used to predict individual social-security numbers, posing a risk of identity theft that policy-makers and individuals should address.

Alessandro Acquisti, Carnegie Mellon University

This finding, an unexpected consequence of public information in modern economies, published (Monday, July 6) in Proceedings of the National Academy of Sciences (PNAS) and highlighted in the New York Times (July 7) and other national media, relied on computational resources of the TeraGrid, a National Science Foundation cyberinfrastructure program. It would have been difficult, if not impossible, to obtain these findings without these publicly-funded, high performance computing (HPC) resources, says one of the lead researchers, Alessandro Acquisti, a professor at Carnegie Mellon University.

About a year ago, at an important phase in the project, Acquisti and his colleague, Ralph Gross, a post-doctoral researcher, and several graduate students who worked with them, began using a large-scale parallel computing system at the Pittsburgh Supercomputing Center (PSC). “At that stage,”says Acquisti,”we had a rough idea of the results, but to go forward we had to try many different variations of the algorithms. It would have been incredibly difficult to do this, or taken much, much longer without access to this system.”

After first working with desktop computers, the researchers turned last year to a PSC system called Pople (named for Nobel laureate chemist John Pople of Carnegie Mellon). A Silicon Graphics Altix 4700, installed in March 2008, Pople has 768 cores (processors) and 1.5 terabytes of shared memory (all of memory accessible from each core). The SSN runs used up to 400 of Pople’s cores and 800 gigabytes of memory, a large memory requirement that made Pople’s shared memory very helpful to the project.

TeraGrid staff at PSC installed Octave – an open-source version of the programming language MATLAB, and wrote a script to submit a large number of parallel Octave jobs simultaneously on Pople. This facilitated the Acquisti team’s interactive process, which involved doing many runs representing different states and computational strategies, checking and analyzing results and re-thinking before running more variations. PSC’s consulting, said Acquisti, was “extremely helpful.”

One fairly unassuming graphical figure in the PNAS paper, notes Acquisti, represents results of “more than 700,000 regressions over very large sets of data,”which to computational scientists gives a sense of the immense computational scope of the problem.

“This project,” said Sergiu Sanielevici, PSC director of scientific applications and user support, who also leads user support and services for the TeraGrid,”exemplifies how powerful systems like Pople can open doors to data-mining and data-centric research in fields not traditionally associated with HPC, such as the social sciences, and make it possible to get answers that would otherwise be impractical or impossible.”

PSC supported this project through the NSF TeraGrid program, which allocates large-scale computing resources free to researchers at U.S. universities on a peer-review proposal basis.

Carnegie Mellon graduate students Jimin Lee, Ihn Aee Choi, Dhruv Deepan Mohindra, and Ioanis Alexander Biternas Wischnienski collaborated in this research with Acquisti and Gross and did much of the hands-on computational work.

Further information about the research: http://www.ssnstudy.org