Data sets in diverse areas such as biology, engineering, or astronomy routinely reach terabyte scales and are expected to grow to petabytes within the next few years. Typical examples include time series measurement or multidimensional volumetric data sets. Due to their rapidly increasing size, these data present severe limitations for data storage, transmission, and processing, and will thus become serious bottlenecks for analysis pipelines and collaborative data analysis. New approaches and frameworks are needed to enable the timely and cost effective analyses at next generation data scales. The Active Data Processing and Transformation File System (ADAPT FS) combines efficient storage of original data with on-the-fly processing to enable collaborative processing and sharing of scientific datasets at the largest scales with minimal data duplication and latency. The remote access will leverage PSC’s SLASH2 distributed filesystem and be extended to provide visualization. ADAPT-FS processing will enable easy access to leading-edge scientific datasets for teaching and training at institutions of all sizes.
ADAPT-FS provides a FUSE based file system interface allowing seamless use from programs or web servers. The guiding principles behind ADAPT-FS are to: 1) eliminate unwanted data duplication; 2) minimize data transfer by working directly from original data when possible; 3) minimize delays between data capture and end-user analyses; and 4) provide a flexible workflow which incorporates active computation. The latter aspect enables improved parallel I/O performance by exploiting untapped CPU and GPU power on the nodes of large data servers. We will provide ADAPT-FS as open source, including a flexible and well-documented API and a plug-in framework enabling users to insert their own GPU and CPU codes into the data pipeline to extend and customize its data processing capabilities. Thus, ADAPT-FS will provide a critical technology to tackle the next generation of massive data intensive processing, allowing efficient and rapid analysis of petabytes size data sets with minimal data duplication in a collaborative multi-site framework.