Complex Analytics for Big Data

February 1, 2013, 1 pm

300 S. Craig St., Pittsburgh, PA

PSC's new YarcData uRiKA system, nicknamed Sherlock, has been built with PSC-specific enhancements and is specially designed to detect relationships hidden in extremely large and complex data sets.

This symposium will explore the state of the art and the open challenges in Big Data computing, explain how uRiKA's advanced technologies address those challenges, and discuss applications that can benefit from this unique computational resource.

"Big Data" has the potential to address critical problems in a wide range of fields, like improving  health care, strengthening cybersecurity, and protecting the environment, but until uRiKA, the complex analyses needed to exploit Big Data's full potential have been impossible.

Agenda

The agenda is subject to change.

Complex Analytics for Big Data
February 1, 2013
1:00 Opening Remarks
Nick Nystrom, Director, Strategic Applications, Pittsburgh Supercomputing Center
1:05

uRiKA! Unlocking the Power of Big Data at PSC
Nick Nystrom, Director, Strategic Applications, Pittsburgh Supercomputing Center

Download Presentation: Nystrom_Sherlock 

Effectively analyzing “big data” is increasingly essential to problems of scientific and societal importance such as discovering how genes work, understanding the dynamics of social networks, and detecting breaches in computer security. What these problems and others like them have in common is that they involve complex networks of relationships, or “graphs.” For many cases of interest, the graphs are highly connected and cannot be partitioned, making their analysis on general-purpose computers extremely challenging. The Pittsburgh Supercomputing Center (PSC) continues its role of offering novel, enabling architectures for big data and HPC by introducing Sherlock, a YarcData uRiKA (“Universal RDF Integration Knowledge Appliance”) system with PSC-specific enhancements, for the U.S. research community. Sherlock, funded by NSF Strategic Technologies for Cyberinfrastructure (STCI), delivers unique capability for discovering new patterns and relationships in big data. Sherlock breaks the barrier to graph analytics through massive multithreading, a shared address space, and sophisticated memory optimizations. PSC customized Sherlock with additional nodes having commodity x86 processors to add valuable support for heterogeneous applications, for example, in genomics and astrophysics. Sherlock runs an enhanced suite of familiar semantic web software for easy access to powerful analytic functionality, together with common programming languages, and PSC’s Data Supercell provides complementary, high-performance access to large datasets for ongoing, collaborative analysis.
1:30 Data-Intensive Scalable Computing: Finding the Right System and Programming Models
Randal E. Bryant, University Professor of Computer Science and Dean, School of Computer Science, Carnegie Mellon University
With the massive amounts of data arising from such diverse sources as telescope imagery, medical records, online transaction records, and web pages, data-intensive computing has the potential to achieve major advances in science, health care, business, and information access. Data-intensive computing has very different properties from traditional high-performance computing applications, calling for new system designs and new programming models.

For meeting their data-intensive computing needs, Internet companies have pioneered a class of systems sometimes referred to as "warehouse-scale computers," where the main focus is on low cost and extreme scalability. Programming models, such as the Map/Reduce framework pioneered by Google, have been developed that enable programmers to express programs at a high level, while letting the runtime software deal with the complexities of coordination, resource management, and fault tolerance. Newer work, such as the GraphLab system developed at CMU, has shown great promise solving large-scale machine-learning problems on this class of machines.

The Sherlock system at the Pittsburgh Supercomputing Center follows a very different strategy from the prevailing warehouse-scale trend. It relies on more specialized hardware providing direct support for sharing data between computing elements. The software environment supports a programming model that represents data as a set of graph-structured objects stored in a database, and with a query language that can extract information from these graphs and create new ones. This talk will review these different approaches to data-intensive computing, from both systems' and programmers' perspectives.

2:00 Graph Models and Efficient exact Algorithms in Studying Cancer Signaling Pathways
Songjain Lu and Xinghua Lu, Department of Biomedical Informatics, University of Pittsburgh
As graphs have advantages in representing data and knowledge in biology, graph models have been widely used in the research of bioinformatics and computational biology, such as cancer study. Biological systems are complex; the majority of problems, when cast as graph problems, belong to a family of computational problems referred to as “NP-hard problems.” For example, the problem of finding cancer pathways can be formulated as k-path problem, which is NP-hard. Hence, designing efficient algorithms to solve this type of computational problems plays a key role in cancer study. When dealing with NP-hard problems, people usually use heuristic or greedy algorithms, which cannot guarantee the quality of their solutions. This will thwart our effort of understanding the disease mechanisms of different cancers. In this presentation, we will discuss how to take advantage of the facts that many NP-hard problems are related to parameters that take only small values in practical application and design parameterized algorithms that can find exact optimal solutions for NP-hard problems in cancer study. We will discuss how to formulate biological problems of reconstructing cancer pathways into k-path or mutually exclusive set cover problems. Then we will discuss the basic idea of parameterized algorithms and give an example of how to use parameterized techniques to design efficient algorithms for the k-path problem. Finally, we will discuss how to parallelize our algorithms and use supercomputers to speed up the computing.
2:30 Break
2:45

Large Graph Mining – Patterns, Tools and Cascade Analysis
Christos Faloutsos, Professor of Computer Science, Carnegie Mellon University

Download Presentation: Faloutsos_Sherlock

What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (e.g., “eigenSpokes”). For tools, we present “oddball” for discovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs. Finally, for cascades and propagation, we present results on epidemic thresholds as well as fast immunization algorithms.
3:15 Building the right tools for Graph Analytics – how the uRiKA appliance enables rapid, iterative discovery
Jim Harrell, Vice President, Engineering, YarcData
A high level technical discussion of the components of uRiKA that make the integrated appliance a perfect engine for large scale graph analytic tasks.
3:45 Discussion
4:00 Closing comments

Registration is open

To register, complete the form at the upper right. All fields are required.

Questions

If you have any questions, please contact PSC at 412-268-4960.

Last Updated on Friday, 01 February 2013 10:48