Sherlock is a novel and powerful system for turning big data into meaningful information, particularly for problems involving large graphs or complex networks. Sherlock is also well-suited to irregular computations that would be latency-bound on conventional architectures, such as agent-based models and operations on large, sparse systems.
Funded through NSF’s Strategic Technologies for Cyberinfrastructure (STCI) program, Sherlock is an experimental system through which the research community can gain experience with its architecture while tackling exciting research challenges.
Sherlock is a YarcData uRiKA™ (Universal RDF Integration Knowledge Appliance) data appliance with PSC enhancements. It enables large-scale, rapid graph analytics through massive multithreading, a shared address space, sophisticated memory optimizations, a productive user environment, and support for heterogeneous applications.
Sherlock consists of both YarcData Graph Analytics Platform (formerly known as next-generation Cray XMT™) nodes and Cray XT5 nodes with standard x86 processors.
Sherlock contains 32 YarcData Graph Analytics Platform nodes, each containing 2 Threadstorm 4.0 (TS4) processors, a SeaStar 2 (SS2) interconnect ASIC, and 32 GB of RAM. Aggregate shared memory is 1 TB, which can accommodate a graph of approximately 10 billion edges. The TS4 processors and SS2 interconnect contain complementary hardware advances specifically for working with graph data. These include support for 128 hardware threads per processor (to mask latency), extended memory semantics, a system-wide shared address space, and sophisticated optimizations to prevent “hotspots” involving contention for data.
PSC has customized Sherlock with additional Cray XT5 nodes having standard x86 processors (AMD Opteron) to add valuable support for heterogeneous applications that use the Threadstorm nodes as graph accelerators. This heterogeneous capability will enable an even broader class of applications, including genomics, astrophysics, and structural analyses of complex networks.
Other x86 nodes serve login, filesystem, database, and system management functions.
PSC’s Data Supercell provides complementary, high-performance access to large datasets for ongoing, collaborative analysis.
There are two main ways of using Sherlock: using the uRiKA™ data appliance for performing complex graph analytics, or running other applications that benefit from the Graph Analytics Platform architecture.
uRiKA™ implements a sophisticated and optimized set of software based on familiar semantic web standards such as RDF and SPARQL. Graph Analytic Application Services (GAAS), built on the WSO2 Web Services Framework, provide functionality for importing data, communications in a variety of message formats, the Jena framework for query algebra and validity testing, support for SPARQL (a query language; see below), and the main user interface. The Graph Analytics Database (GAD) runs on theGraph Analytic Platform nodes.
RDF, the Resource Description Framework, is a general and expressive way to represent graph data. In RDF, “triples” describe relations, where each triple consists of a subject, a predicate, and an object that expresses a relationship (the predicate) from one node (the subject) to another node (the object). uRiKA™ extends RDF triples to “quads”, in which the fourth field is a graph identifier, allowing for analyses that span multiple graphs.
In RDF, subjects and predicates are URIs, and objects can be URIs or literals. This has several powerful implications:
- Unlike relational databases, a schema does not have to be defined a priori. New relations can be added as they arise without having to rebuild the graph database. Predicates (relations) are also data.
- There is built-in support for sparsity. Relations that are defined for only certain nodes do not consume memory and time resources as they would in a relational representation.
- Use of URIs provides seamless linking of connected Web-based data.
SPARQL, the Protocol and RDF Query Language, contains capabilities for querying required and optional graph patterns and supports extensible value testing and constraining queries. Query results can be RDF graphs or simpler result sets.
For applications that will have considerable use, especially by end users who prefer not to develop SPARQL queries, a variety of user interface frameworks, for example Cytoscape, can be used to deploy convenient, browser-based graphical user interfaces.
Software for Using Graph Analytics Platform Nodes
Other types of applications can be written to run entirely on the Graph Analytics Platform nodes or to run in a heterogeneous fashion between the x86 XT5 nodes and the Graph Analytics Platform nodes. The latter treats the Graph Analytics Platform nodes as a “graph accelerator” and may be beneficial for porting applications having well-defined graph kernels.
The Graph Analytics Platform nodes are programmed using threaded C or C++ with library calls for atomic memory access, synchronization, futures, etc. Complete details are available in the Cray XMT Programming Model and Cray XMT Programming Environment User’s Guide. Also available are the Cray XMT Performance Tools and the Cray XMT Debugger.
The XT5 nodes, having x86 processors and running SUSE Linux, support a full variety of programming languages including C, C++, Java, Fortran, and scripting languages.
Sherlock runs the Cray Linux Environment (CLE), a distributed operating system with different components for compute and service nodes. The Graph Analytics Platform nodes run the Cray Multi-Threaded Kernel (MTK), and the XT5 and service nodes run SUSE Linux.
Access to Sherlock