Pittsburgh Supercomputing Center 

Advancing the state-of-the-art in high-performance computing,
communications and data analytics.

Data-Intensive Analysis, Analytics and Informatics

Proceedings

Introduction

Nick Nystrom, Pittsburgh Supercomputing Center
View [PDF]    Download [PDF]

Opening Talk: Data Intensive Supercomputing

Randy Bryant, Carnegie Mellon University: Data Intensive Scalable Computing: Finding the Right Programming Models
View [PDF]    Download [PDF]

Data Analysis Requirements from Instruments and Sensors

Duncan Brown, Syracuse University: Computational Challenges in Gravitational Wave Astronomy
View [PDF]    Download [PDF]
Thomas Hacker, Purdue University: NEEShub: A Data Cyberinfrastructure for Earthquake Engineering
The National Science Foundation George E. Brown Jr. Network for Earthquake Engineering Simulation (NEES) is a large-scale distributed science and engineering infrastructure of 14 earthquake engineering laboratories located across the United States. Data produced within the NEES network consists of data generated by sensors, images, videos of the experiment, documentation, and papers. The most pressing problems encountered by producers and consumers of data cyberinfrastructure today include: managing the gathering and upload of thousands of data files, ensuring the coherency, authenticity, and integrity of files; and providing mechanisms to collect sufficient metadata to ensure long-term data accessibility and preservation. These problems are encountered by researchers today, and will also be faced by future consumers of data. This talk describes the approach we are developing for NEES to address these problems. Our approach, the NEEShub, provides an data cyberinfrastructure that integrates computation and data to allow researchers and practitioners to easily upload new data, explore experimental data, and download project data for earthquake engineering.

View [PDF]    Download [PDF]
John Orcutt, University of California at San Diego: Stream Processing of Multi-scale, Near-Real-Time Environmental Data
The US NSF's Ocean Observatories Initiative (OOI) is being supported to collect ocean data for a period of 25-30 years. The project began in October 2009 for a five year construction phase to design, test and build the initial infrastructure for a comprehensive observational system needed to support the collection, dissemination and preservation of data including data directly related to in situ ocean climate observations. The OOI Cyberinfrastructure is a departure from previous data efforts in the oceans. In particular, the data must be open and available in near-real-time with latencies as low as seconds. Users of the data will be able to construct their own versions of the OOI observatory concentrating on the data streams of greatest meaning and, in turn, will be able to publish data on the OOI CI relevant to the individual's or team's interests. The OOI has constructed a Data Distribution Network (DDN) in collaboration with NOAA, which will provide data in the formats most typically used; the OOI is not creating yet another data standard, but adapting the available data to the oceanographic community's current and growing practices. The DDN is presently running on Amazon's Elastic Compute Cloud (EC2); it is possible that many future oceanographic computations and particularly numerical models will be running on this and similar cloud systems to provide needed elasticity in scaling computing needs for dealing with events. Metadata associated with all data will allow the discovery of data and instruments of interest. In addition to the data resulting from the deployment of new platforms and sensors, the CI requirements include the development of interfaces with other programs and repositories including NOAA and the WMO. The OOI Cyberinfrastructure has designed an architecture that minimizes a metric we refer to as Time to Science (TTS) to remove as many impediments as possible to ready data access. Release 1 of the software/middleware will be available in Summer 2011. The OOI CI's Integrated Ocean Network allows not only data return from sensors, but provides the connectivity to control individual platforms (e.g. a cabled network) and sensors (e.g. surface pH). The connections, while made over the network (cables and satellites), employs message passing for data and commands mediated by the Advanced Message Queuing Protocol (AMQP). The ocean network includes a 10Gbps continental facility for network and data management. Connectivity to Europe through Starlight in Chicago and two ports in the western US to Asia can be activated in the future to support potential global collaboration in data, sensor, and modeling activities. The stream processing of the near-real-time data to detect events and assimilate data into climate models to create ensembles for statistical analysis present special challenges to large-scale computing and networking.

View [PDF]    Download [PDF]
Kirk Borne, George Mason University: Petascale Data Challenges and Design Decisions for the Large Synoptic Survey Telescope Science Data System
I will review the plans for the largest data-producing astronomy facility in the coming decade (the Large Synoptic Survey Telescope: LSST), with a primary focus on the LSST data challenges, the data management system (requirements, specifications, and architecture design), science data quality assessment challenges, science use cases enabled by the petascale data repository, science data analysis requirements, mining the 10-year data flow, and key technologies to enable all of this. LSST will be a breakthrough project for data-intensive astronomy, in which the data management effort (hardware, development time, and people) is comparable to the other major components of the observatory - the telescope and the camera. As a consequence of the major role that data-intensive analysis will have for the success of the project's science goals, the LSST program established a dedicated research group who is focused on the analytics, informatics, and statistical ch allenges associated with petascale data. The activities of this research group will also be described.

View [PDF]     Download [PDF]
Art Wetzel, Pittsburgh Supercomputing Center: Connectomics: Challenges in Reconstructing Neural Circuitry from PBytes of Electron Microscopy Data
The newly developing field of connectomics is the study of the complete neural circuit connectivity of an organism. Accurate determination of detailed neural pathways to the synaptic level requires the nanoscale resolution provided by electron microscopy. This talk will describe challenges inherent in the capture and analysis of the huge EM datasets needed to cover the tissue volume of localized circuits currently or, in the near future, the entire brain of small organisms such insects or larval fish.

View [PDF]    Download [PDF]

Data-Intensive Science, Approaches and Algorithms I

Homa Karimabadi, UCSD: Physics Mining: New Approach to Science Data Discovery in Petascale Systems
The amount of data collected and stored electronically is doubling every three years. Simulations running on peta-scale computers (e.g., NSF's Kraken with 100 K cores) generate extremely large amounts of data and contribute to this data tsunami. As an example, one of our single runs can produce over 200 TB of data. Similarly 90% of the data collected from various spacecraft missions in heliosciences are unexplored. The recent Solar Dynamics Observatory is producing an unprecedented quantity - over a terabyte of data a day - of image and spectral data.

In the face of this unprecedented growth in the amount of data, our capability to manipulate, explore, and understand large datasets is growing only slowly. In this presentation, we discuss our approach to effective analysis of large data sets from both simulations and spacecraft data. In particular, we are developing a first-of-its kind tool kit called SciViz. SciViz is an open source tool and brings together key innovations in three separate fields of scientific visualization, data mining, and computer vision to offer an integrated solution for physics mining of the most complex of data sets: multi-dimensional, multi-variate data sets. The power of this technique is illustrated through several examples.


View [PDF]    Download [PDF]    Download [PPT]
William Cohen, Carnegie Mellon University: Learning to Extract a Broad-Coverage Knowledge Base from the Web
Over the last few years, dramatic progress has been made in techniques to automatically discover semantic content from text. In my talk, I will survey recent work on web-based ""open information extraction"" methods - machine learning techniques that learn relationships and concepts given very small amounts of user input, and very large amounts of unlabeled textual data. I will discuss the connections between these learning methods and the technical problem of clustering very large graphs, and describe our experience in using these methods in NELL, a ""never ending language learner"" that incorporates several open information extraction techniques into a single system, and has extracted a knowledge base containing hundreds of thousands of confident ""beliefs"".

View [PDF]    Download [PDF]
Michael Schatz, Cold Spring Harbor Laboratory: Cloud Computing and the DNA Data Race
In the race between DNA sequencing throughput and computer speed, sequencing is winning by a mile. Sequencing throughput is currently around 200 to 300 billions of bases per run on a single sequencing machine, and is improving at a rate of about fivefold per year. In comparison, computer performance generally follows "Moore's Law", doubling only every 18 or 24 months. As the gap in performance widens, the question of how to design higher-throughput analysis pipelines becomes crucial. One option is to enhance and refine the algorithms to make better use of a fixed amount of computing power. Unfortunately, algorithmic breakthroughs of this kind, like scientific breakthroughs, are difficult to plan or foresee. The most practical option is to develop methods that make better use of multiple computers and processors in parallel. This presentation will describe some of my recent work using the distributed programming environment Hadoop/MapReduce in conjunction with cloud computing to dramatically accelerate several important computations in genomics, including short read mapping & genotyping, sequencing error correction, and de novo assembly of large genomes.

View [PDF]    Download [PDF]
Chris Hill, Massachusetts Institute of Technology: Infrastructure for Extreme Data Intensive Computing
In the Petascale Arctic, Atlantic and Antarctic Virtual Experiment (PAVE) we are developing large scale models of ocean circulation at resolutions of close to 1km. At these resolutions qualitative differences in the behavior of an ocean model are striking -- with models reproducing known and observed fluid dynamics at significantly improved fidelity. These computations are tools that can be used to test hypotheses and inform analysis of term balances among processes involved in key physical, chemical and biological cycles of the Earth. One target platform for the computations is the NSF Blue Waters system. Making fully productive use of these simulations will involve handling petabytes of digital information. This has led us to begin colloborations with other researchers with an interest in extreme data-intensive computation.

My talk will examine three motivating applications (including the PAVE project) representing diverse disciplines. These applications are proxies for many other petascale problems with similar demands. The talk will highlight activities we are beginning, aimed at developing a common core petascale data solution. The approach revisits a role for database inspired technology in HPC. The database concepts we are examining draw on lessons from past efforts and from recent develpments around "shared nothing" architectures such as GoogleFS and Hadoop. A database system called SciDB, being developed by MIT and Brown researchers and their colleagues, will be used to illustrate a possibly promising path forward for next generation extreme data management needs.

View [PDF]     Download [PDF]

Yucheng Low, Carnegie Mellon University: GraphLab: A Distributed Framework for Machine Learning
Machine Learning (ML) techniques are indispensable in a wide range of fields. Unfortunately, the exponential increase of dataset sizes are rapidly extending the runtime of sequential algorithms and threatening to slow future progress in ML. However, designing and implementing efficient, provably correct distributed ML algorithms is often prohibitively challenging. By targeting common patterns in ML, we developed GraphLab, which improves upon abstractions like MapReduce by compactly expressing asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We conducted a comprehensive evaluation of a distributed implementation of GraphLab on three state-of-the-art ML algorithms using real large-scale data and a 64 node EC2 cluster of 512 processors; demonstrating that GraphLab achieves orders of magnitude performance gains over Hadoop while performing comparably or superior to hand-tuned MPI implementations.

View [PDF]    Download [PDF]
Jeff Gardner, University of Washington: Data Intensive Scalable Computing for Astronomy
Astrophysics is addressing many fundamental questions about the nature of the universe through a series of ambitious wide-field optical and infrared imaging surveys (e.g., studying the properties of dark matter and the nature of dark energy) as well as complementary petaflop-scale cosmological simulations. Our research focuses on exploring relational databases and emerging data-intensive frameworks like Hadoop for astrophysical datasets. In this talk we will explore the space of these different tools and discuss our attempts to apply them to various data challenges in astronomy.

View [PDF]    Download [PDF]
John Johnson, Pacific Northwest National Laboratory: Data-Intensive Cyber Data Analytics
View [PDF]    Download [PDF]

Data-Intensive Science, Approaches and Algorithms II

Joohyun Kim, Louisiana State University: DARE-NGS: Towards an Extensible and Scalable NGS Analytics on the TeraGrid/XD
Next-Generation (gene) Sequencing (NGS) machines produce unprecedented amounts of data. In addition to the challenge of data-management that arise from unprecedented volumes of data, there exists the important requirement of effectively analyzing large volumes of data. There exist many qualitatively distinct analytical approaches that such NGS data require for effective analysis.

Interestingly, the cyberinfrastructure considerations required to support a broad-range of analytical approaches and at the scales required, has received less attention than the data-management problem and algorithmic advances. Thus not surprisingly, traditional production cyberinfrastructure, such as the TeraGrid, have not been used for such data-intensive analytics. There are multiple reasons, but a couple of contributing factors are: (i) insufficient runtime enviroments (and abstractions) to support concurrent computational capabilties with large-data sets to support data-analytics (beyond visualization) in an easy, scalable and extensible fashion, (ii) insufficient support for user-customizable data-intensive "workflows" that effectively hide the challenges of data-movement and efficient data-management whilst managing concurrent distributed (computational) resources.

Additionally, it is worth mentioning that the computational complexity of the analysis (e.g. mapping) depends, upon other things, the size and complexity of the reference genome & the data-size of short reads. Given that these can vary significantly, the computational requirements of NGS-analytics also varies (even between data-sets of similar size). Thus an efficient, scalable and extensible analytical approach must be supported by any framework supporting NGS-analytics.

To address these concerns, we have created the DARE-NGS Gateway (http://cyder.cct.lsu.edu/dare-ngs) which supports Genome-wide analysis on the TeraGrid and other distributed cyber-infrastructure. DARE-NGS builds upon the Distributed Adaptive Runtime-Environment (DARE) Framework, which supports a range of tasks with varying computing and data requirements over a wide range of high-performance and distributed infrastructure.

We will present results of using DARE-NGS on the TeraGrid and FutureGrid, for BFAST applications as the representative example of the typical analysis that is required for NGS data. We also share our experiences and the understanding gained by analysing the full Human-Genome (requiring data-sets of upwards of 250GB) on the NSF FutureGrid and on TeraGrid machines such as Ranger.



View [PDF]    Download [PDF]
Wayne Pfeiffer, San Diego Supercomputer Center: Compute- and Data-Intensive Analyses in Bioinformatics
The advent of high-throughput DNA sequencers has produced a flood of genomic data. How big is this flood, and what are the computational requirements for analyzing the data? These questions will be addressed for three common types of bioinformatics analyses: read mapping (including pairwise alignment), de novo assembly, and phylogenetic tree inference. In addition, use of the TeraGrid for phylogenetic analyses via the CIPRES gateway will be summarized.

View [PDF]    Download [PDF]
James Taylor, Emory University: Accessible, Transparent, and Reproducible Data Analysis with Galaxy
View [PDF]    Download [PDF]
Rupert Croft, Carnegie Mellon University: Visualization of Petascale Cosmological Simulations
View [PDF]    Download [PDF]

Panel: Systems for Data-Intensive Analysis

Scott Ahern, National Institute for Computational Sciences: The University of Tennessee Center for Remote Data Analysis and Visualization (RDAV)
View [PDF]    Download [PDF]
Jay Alameda, National Center for Supercomputing Applications: NCSA Resources for Data-Intensive Analysis
View [PDF]    Download [PDF]
Nick Nystrom, Pittsburgh Supercomputing Center: Blacklight: A Very Large Hardware-Coherent Shared Memory for Data Analytics and Data-Intensive Simulation
View [PDF]    Download [PDF]
Allan Snavely, San Diego Supercomputer Center: Gordon Applications
View [PDF]    Download [PDF]
Dan Stanzione, Texas Advanced Computing Center: The Longhorn System for Visualization and Data-Intensive Computing
View [PDF]    Download [PDF]