Active Projects on Neocortex
Aditya Balu, Iowa State University
Neural PDE Solvers on regular and irregular domains
Neural network-based approaches for solving partial differential equations (PDEs) have recently received special attention. While most of these approaches are point-based (implicit neural representation), very few approaches deal with parametric PDEs (i.e. a diverse set of boundary/initial conditions or a family of PDEs). Further, a large majority of neural PDE solvers only apply to rectilinear domains and do not systematically address the imposition of Dirichlet/Neumann boundary conditions over irregular domain boundaries. Over the past couple of years, we have developed a series of neural PDE solvers that use Finite element basis functions to obtain the loss function for a particular PDE and can apply boundary conditions naturally. We extended this to more complex domains and large domains (Mega voxel domains). Further, we recently proposed to neurally solve partial differential equations over domains with irregularly shaped (non-rectilinear) geometric boundaries. the key technical ingredient to realizing this model is a novel approach for identifying the interior and exterior of the computational grid in a differentiable manner. We apply for this proposal to be able to extend these neural PDE solvers to be able to go to gigavoxel domains (which cannot be achieved in modern CPU and GPU HPC clusters). We hope to use the Cerebras system to be able to scale this to large-scale problems.
Mark Bower, Yale University
Apply Machine Learning to Predict Antibody Drug Developability
Recently, we have developed the Network Parameter Outlier (NPO) algorithm, that uses graphical clustering methods to cluster biological event data (e.g., action potentials, sharp-wave ripples) in O(n) time, improving on current methods that require O(nlog n) time. Biologically-relevant events are detected and stored as a graph with connections weighted by corresponding correlation coefficients. A graphical clustering technique (Louvain) partitions the graph into clusters. This process is repeated in a moving-window fashion with new vertices being added and old vertices being dropped. Cluster labels are passed to succeeding windows, generating a consistent clustering across an arbitrarily large data set. Because the windowed-data graph size remains constant, processing time is bounded, which allows the total clustering time to increase linearly with increasing data size. The optimal parameters regarding both computational time and overall performance, however, are unknown due to memory limitations of conventional CPUs. We propose to use the large-memory model of Neocortex to allow graph sizes that are much large than can be supported by current computing hardware. The algorithm has been wrapped in a Singularity container linked to a MySQL database (also running in a Singularity container) to allow portable computation.
Gert Cauwenberghs, University of California San Diego
Neocortex System for simulating energy-efficient large-scale spiking neural networks
The goal of this proposal is for the opportunity to access the remarkable Cerebras CS-2 and the HPE Superdome servers on the Pittsburgh Supercomputing Center’s Neocortex system. The proposed hardware will be used for cutting-edge AI research, specifically for simulating energy-efficient large-scale spiking neural networks. There has been a paradigm shift in recent years from conventional artificial neural networks (ANNs) to spiking neural networks (SNNs) due to their ability to simulate more complex functionalities and work more efficiently with spatiotemporal data. However, simulating large-scale SNNs with millions of neurons and billions of synapses is a computationally intensive task that requires specialized hardware. The new Cerebras CS-2 system can revolutionize the simulation of large-scale SNNs for research applications in different fields, including healthcare, robotics, and neuroscience.
Jee Choi, University of Oregon
High-Performance Tensor Decomposition on Massively Parallel Data Flow Architecture
Many critical applications, such as data mining, social network analytics, cybersecurity, and healthcare, generate massive amounts of multidimensional data as sparse tensors that can be analyzed quickly and efficiently using tensor decomposition (TD). TD algorithms for high-dimensional sparse data are challenging to execute on emerging parallel architectures due to their low arithmetic intensity, irregular memory access, workload imbalance, and synchronization overhead. In TD algorithms, each non-zero element is associated with N indices in an N-dimensional space, which are used to retrieve rows from source matrices that are used to calculate an update for a destination matrix row. This type of processing is potentially well-suited for data flow architectures, where each non-zero element is passed through a series of data accesses and floating-point operations, and the final result associated with the non-zero is accumulated in a serialized/atomic manner. This is particularly true for streaming data, which flows in continuously over time. However, due to the lack of commercially available data flow processors, only a few studies have been done on FGPAs with limited performance improvements. The CS-2 system offers large compute performance, coupled with extremely high memory and interconnect bandwidth, allowing for our algorithm to achieve unprecedented level of performance, while simultaneously offering interesting performance challenges associated with its unique data flow architecture. We propose to develop and implement novel parallel algorithms for various TD algorithms and use it to analyze both static and streaming data.
Biprateep Dey, University of Pittsburgh
Making the Largest Map of Our Universe
Maps of our Universe help us study how the contents of the universe evolved with time, how galaxies formed and developed, and pinpoint the location of short-lived astrophysical events like supernovae or sources of gravitational waves. To make such maps, astronomers measure a quantity called redshift which is a proxy for galaxy distance. However, high precision measurements of redshifts can only be done for a tiny fraction of galaxies for which we have images. Because of this limitation, it is necessary to infer redshift information from imaging data alone; the resulting measurements are called photometric redshifts. Precise estimates of photometric redshifts, along with associated uncertainties, are key to astronomical research for the coming decade since a majority of the data sets that will be available to the astronomy community will come from large-scale imaging only astronomical surveys and depend heavily on photometric redshifts to measure galaxy distances. We propose to use one such data set of over 32 million galaxy images called the DESI Legacy Imaging Surveys to pre-train a vision transformer-based self-supervised model which will learn a latent representation of the galaxy images. We then plan to fine-tune the model using a small number of known galaxy redshifts from the Dark Energy Spectroscopic Instrument (DESI) to finally infer the redshifts to the galaxy for which we only have images. This map of our Universe will enable the myriad of scientific cases previously mentioned.
Artur Dubrawski, Carnegie Mellon University
Automatic Text Summarization of BioSignals
Automated translation of data to textual narratives is commonly referred to in literature as data-to-text generation. Here, data refers to an entity that is not exclusively linguistic (eg: graphs, tables, knowledge bases). While most existing data-to-text approaches deal with tabular data and graphs, few approaches have focused on textual summaries of time-series data (Time-Series Captioning, or Time-Series Summarization). This has largely been due to the lack of paired time-series and text data. Biosignal time-series summarization has been studied in EEG-to-text and ECG-to-text. Prior work advocates for a human-in-the-loop system to generate the text summaries, stating that automated reports are too erroneous. While we note the merit in human-guided text generation, our proposed method requires far less effort from the associated medical professional. Another overarching limitation observed in all data-to-text generation methods is their evaluation metrics. Almost all methods report standard text generation metrics like BLEU and BERTScore that are simply inadequate in evaluating correctness in domains like healthcare. While many methods perform expert (human) evaluation, we note that there are no standard guidelines on how to evaluate medical text reports. Our proposed method addresses all these limitations.
Giulia Fanti, Carnegie Mellon University
Privacy-preserving synthetic data from federated clients
Synthetic data refers to randomized data that is drawn from the same (or a similar) distribution to an underlying ground truth dataset. In recent years, synthetic data has become remarkably high-quality, due to the growing successes of deep generative models. However, synthetic data from deep generative models is typically trained over centralized datasets. In practice, one important use case for synthetic data is to understand data patterns at distributed clients (e.g., in a federated learning (FL) setting). Our goal in this project is to design a method for generating synthetic data at a central server, from the data of many distributed, privacy-conscious clients. We propose to achieve this by first computing a privacy-preserving estimate of the mean and covariance of client embeddings, under a pre-trained embedder, as described in our upcoming paper at the TrustML Workshop at ICLR 2023. Then, we aim to generate synthetic data at the server side that matches the privately-estimated embedding distribution of the client data, by decoding a private embedding into a full sentence. In doing so, we wish to explore whether federated client fine-tuning can be eliminated in some cases, in favor of fine-tuning on privately-generated synthetic datasets. Our proposed pipeline will require fine-tuning (or possibly retraining) standard large language models, such as BERT or T5 on benchmark datasets.
Yunhe Feng, University of North Texas
Accessing Social Impacts of Emerging Deep Generative Models through Public Big Data
Deep generative models have been becoming one of the most controversial artificial intelligence techniques in academia and industry in recent years due to their abuse usage and unexpected societal impacts. DeepFake, one of the deep generative models that can be used to create synthetic images and videos about people, has played a significant role in fake news, misinformation, blackmail, and pornography, causing a lot of information chaos on the Internet. Because of these unpredicted risks and potential biases, state-of-the-art text-to-image generative models (e.g., OpenAI’s Dalle 2 and Google’s Imagen and Parti) have been announced, but they are not accessible to the public. However, independent researchers replicated these announced text-to-image generative models by adopting the published underlying training approaches. They also open sourced their pre-trained models and provided online trial services, making them available to everyone. Although the models replicated by independent researchers are not as powerful as the authentic ones due to smaller training datasets and limited GPU resources, people enjoy playing them and sharing their findings online. For example, in June 2022, around 50,000 images were generated by online Dalle Mini (a replication of OpenAI’s Dall-E) service per day, and Dalle Mini went viral on social media. In this proposal, we use social media as a lens to investigate and estimate the potential amplified biases, potential anti-society risks, and negative societal impacts and threats of the unreleased generative models by conducting a systematic analysis of the numerous public postings from multiple social media platforms including Twitter and Reddit. More specifically, we will collect a sizeable multi-modal dataset containing the generated images and text used for image generation from social networks. Then, we will load and fine tune the pre-trained image object detection models and large language models (LLM) to detect the offensive elements in these generative images and study the relationship between the input text and the output images. Thus, we will gain a deep understanding of how ordinary people use these powerful generative models and what potential negative impacts the models may bring. We think the WSE storage and GPU computing resources will significantly facilitate our research for the collected data storage and deep learning model training.
Franz Franchetti, Carnegie Mellon University
SPIRAL Code Generation for Neocortex
The SPIRAL system builds on 20 years of research by PIs Franchetti, Moura, Hoe, and Low, and 9 years of commercial R&D in the SPIRAL effort. SPIRAL has demonstrated across a wide range of hardware architectures that it is able to produce software that outperforms the best human programmers, and was designed to automatically deliver the performance of the best hand-tuned code on a new target platform on day one. SPIRAL has successfully targeted single core/multicore/manycore CPUs, GPUs, DSPs, the Cell BE, Xeon PHI, FPGAs, clusters, up to 128k cores on BlueGene/L/P/Q, the 20 K computer, and in pre-silicon settings (IBM Cell BE and BG/Q, and Intel AVX and Xeon PHI). Work in DARPA BRASS is enabling SPIRAL to run as a just-in-time compiler, and we are extending SPIRAL to support CNN/DNN kernels, graph and sparse matrix algorithms in a GraphBLAS related effort. In DARPA PERFECT we used SPIRAL to program and configure the HAMLeT memory side data reorganization unit we developed. Our work on MEALib in PERFECT demonstrates how standard C++ source code can be interpreted as an embedded DSL program, compiled with SPIRAL, and run on advanced memory-side accelerators without any change to the source code, leading to 150x performance gain and 8000x power efficiency gain. SPIRAL is also used as code generation backend in the DOE ExaScale effort FFTX and SpectralPack and the DARPA DPRIVE program to target homomorphic encryption. In this effort we plan to target the WSE system with SPIRAL to explore how to make our code generation technology compatible with the WSE sysytem.
Siddhartha Ghosh, National Center for Atmospheric Research
Exploring Wafer Scale Engine on Fluid Dynamics Simulations for Atmospheric and Other Applications
Numerical weather prediction (NWP) models are often implemented using well known finite difference or finite volume numerical schemes characterized as low arithmetic intensity algorithms. They are typically limited by memory bandwidth and latency and parallelized on a x-y (lat-lon) grid, with a small number of vertical levels, with an order of magnitude of 10. As such, they appear to be a great fit for the WSE architecture. The efforts supported by this allocation request would seek to assess performance capacity of the WSE architecture for stencil-based numerical approaches that underpin many NWP codes in existence today.
Gabriel Gomes, Carnegie Mellon University
Multimodal learning of chemistry-aware molecular representations
Machine learning (ML) is already widely used in predicting molecular properties (ranging from energies to biological activities), designing new ones, and analyzing entire reactions. Improvement of performance of ML models in these tasks is crucial for many applications, including but not limited to drug design, material discovery, and automated synthesis planning. The most important factor in succeeding in developing such solutions is not the algorithm but rather the initial representation of the molecule. Many solutions, from SMILES strings to molecular graph representations, correctly represent the molecular structure but lack important chemical information, such as information about the electronic structure. This project aims to infuse electronic information into various types of molecular representations by constructing a joint feature space. In the first step, multiple autoregressive models will be pretrained to grasp general trends in molecular structure distributions from large chemical structure datasets. Then, structure encoders from these models will be finetuned to increase the mutual information and create a unified representation across all input data modalities. Finally, these representations will be tested on various downstream tasks. The results of this work will accelerate research in many critical areas, providing a way to infuse molecular electronic properties into multiple types of molecular representations. Requested resources will be primarily used to train Transformer-based language models, yielding good SMILES/SELFIES encoders to perform further join feature space construction process. We also plan to try training graph neural networks on Neocortex infrastructure.
Berkley Gryder, Case Western Reserve University
Testing the Limits of Deep Learning for the Discovery of Covalent Disrupters of Protein-Protein Interactions
P300 is a histone acetyltransferase that acts on a wide-range of proteins in the cell. Unlike its homolog CREBBP (CBP), aberrant p300 activity is often implicated as critical in driving dysregulation and disease. In the case of Alveolar Rhabdomyosarcoma (aRMS), recruitment of p300 by the fusion protein PAX3-FOXO1 (P3F) is one of the key causes of the transcriptional activation of core regulatory transcription factors driving disease progression. Prior work has shown the limited utility of catalytic inhibitors in slowing disease progression, as well as the strong impacts of p300 degradation in some diseases. However, very little previous work has explored the prospect of targeting the specific protein-protein interaction implicated in driving aRMS. This proposed work will use deep learning enhanced docking techniques to probe the interaction space between p300 and P3F, combined with recent advances in machine-learning based small molecule discovery and refinement to provide a library of promising molecules to disrupt this key interaction. The goals of this work are not only to identify a set of drugs which could be synthesized as first-in-class p300-P3F covalent disrupters, but also to build a cutting-edge pipeline for model-based drug discovery based on structural proteomics and novel chemoinformatics.
Thomas Hales, University of Pittsburgh
Formal Abstracts in Mathematics
A major goal of the international math community is to obtain tools for the automated processing and transformation of mathematical documents. These tools do not currently exist in a satisfactory form and extensive research is oriented towards improving on this. The Formal Abstracts Project aims to provide mathematicians with software tools for stating their research results in a human/machine readable format amenable to formal verification. In order to achieve this goal, the Formal Abstracts Project has recognized the need for (1) a comprehensive vocabulary of mathematics, in order to state research results, and for (2) improved automated reasoning tools to aid in processing and formally verifying those statements. Using the startup allocation #TG-DMS190028, we have (1) applied techniques from Natural Language Processing (NLP) to produce a dataset of labeled definitions extracted from the entire arXiv mathematics corpus, and (2) applied deep learning to accelerate SAT solvers, a type of automated reasoning tool. We consider these successes to be promising first steps towards our ultimate goal. We propose a strategy to continue work in this vein, including a machine learning methodology that uses well-established techniques in NLP. This methodology consists of a detailed strategy to obtain and process the relevant data, and has been thoroughly peer-reviewed by the Mathematical Knowledge Management (MKM) community. Our group has received a major grant from the Alfred P. Sloan Foundation (G-2018-10067) to develop software and services for transforming mathematical results into formally structured data that machines can read, process, search, check, compute with, and learn from as logical statements. This puts our group in an ideal position to create this much needed resource.
Chinmay Hegde, New York University
Towards Deep Vision-Language Models for Ecological Monitoring
A key upcoming challenge in preserving the Earth’s biosphere will be the continuous monitoring of a variety of animal and plant species present in given ecosystems. Artificial Intelligence (AI) can play an important role for solving this challenge, and very recent advances in vision-language models (VLMs) have the promise to build powerful human-interpretable AI models that can be deployed on a wide array of datasets. However, unlike standard deep neural networks, these models are extremely challenging to train on workstations or even small clusters. This Neocortex project will serve two purposes. First, it will help my lab build the first open-source VLMs for image classification and object detection for ecological monitoring. These will be trained on large-scale image datasets (such as iNaturalist) and will specifically be fine-tuned for robust species tracking and monitoring. Second, all software developed within this project will be open-sourced and will serve as stepping stones in the intersection of machine learning and computational sustainability.
Bin Hu, Los Alamos National Laboratory
Explain SARS-CoV-2 Spike Protein Evolution using AI
Viral pathogens target ‘receptor’ proteins on the surface of host cells to initiate infection. Recognition of the receptor is coordinated by a viral surface protein, and the strength of binding, determined by the biochemical properties of the amino acid sequences of the viral and host proteins, often dictates the course of disease. Because of the impact on viral fitness during immune host responses caused by viral infection, there is constant evolutionary selective pressure asserted on these viral surface proteins, where mutations that provide a fitness advantage to the virus will outcompete and spread more rapidly than other variants. Because of their accessibility, viral surface proteins are also common targets for vaccines and therapeutic antibodies. Mutations in viral surface proteins can perturb antibody binding and lead to increased infection or even escape from vaccine and therapeutic regimes. Evidence points to this being the case with some of the more recent lineages of SARS-CoV-2 that continues to spread within the US. We have developed a machine learning (ML) approach to study deep mutational data and resulting phenotypes of receptor binding domain variants of the SARS-CoV-2 Spike protein. This ML model can accurately predict the expression level of viral proteins, including those that display combinatorial mutations, as well as predict their affinity to the human ACE2 receptor. We plan to further develop this model and test several alternative model architectures using natural language processing and graph models to increase the explainability of the model. The ultimate goal of this work is to develop explainable AI methodology for studying viral evolutions and associated biothreats.
Robust Fault Detection of Cooling Systems using Multimodal Fusion
The ever-increasing vehicle electrification has led to critical challenges to electronic cooling. High power pulsed load may cause faults of the cooling systems (e.g., boiling crisis) that may eventually lead to overheating and device failures. Due to the stochasticity of the cooling process, traditional physics-based thermal models are not capable of handling transient heat loads. Deep learning models have been developed for fault detection during two-phase cooling based on single-channel signals but suffer from low generalizability and interpretability. To address this issue requires considering creative and novel data analytic approaches involving theoretical mathematics. A recent subject that provides a promising approach is called topological data analysis (TDA) and its principle tool, persistent homology (PH). The proposed project seeks to develop an interpretable fusion model for two-phase cooling fault detection that leverages multimodal sensor signals from cooling systems (e.g., temperature, pressure, sound, and images), the pre and post-processing power, and internal DL modeling capabilities of TDA and PH, and attention-based interpretation to improve model accuracy, reliability, and interpretability. Multimodal signals from heterogeneous sources will be collected to create a database for two-phase cooling data. A multimodal fusion network will be developed and trained using the database with integrated TDA/PH capabilities for data compression and feature engineering and the interpretability of the network will be examined through attention maps-based analysis.
Efficient Optimization of Docking Configurations using Sparse Convolutional Neural Networks towards Automating Ultra-Large-Scale Docking Virtual Screens
Large-scale virtual screening campaigns are on the frontline of modern drug discovery. They allow quick in silico selection of the best candidate drug molecules based on the estimated strength of their interaction with a target protein, thus saving time and costs on experimental testing. Screening billions of compounds requires fast evaluation of a protein-ligand binding affinity. For example, the DOCK program’s average speed is about 1 compound/sec/core. Such a high speed is achieved by pre-computing the interaction grids in the binding pocket of the protein, which are later used to estimate binding affinities. In order to initiate a large-scale docking campaign, a researcher needs to generate grids that (1) produce correct binding conformations for known ligands and (2) predict higher scores for high-affinity molecules (“actives”) compared to any other randomly chosen compound (“decoys”). Optimization of grids typically require several weeks of skilled labor by a trained computational chemist, involving the informed variation of several parameters using heuristics, trial-and-error, and iteration. This process can be simplified and accelerated by building a sparse convolutional neural network capable of predicting the optimal parameters for the grid generation process based on the structure of a receptor-ligand complex.
Bikash Kanungo, University of Michigan
A Data Driven Approach to Improved Exchange-Correlation Functionals in DFT
Wavefunction theory (WFT) methods and density functional theory (DFT) constitute the two most widely used ab-initio strategies for chemical and materials simulations. The WFT methods, such as configuration interaction (CI), can be tuned to arbitrary accuracy, but scale poorly with the number of electrons. DFT, on the other hand, is highly scalable and allows for a formally exact reduction of the many-electron problem to an effective single-electron problem, called the Kohn-Sham (KS) eigenvalue problem. However, this comes at the cost of making a crucial approximation for the exchange-correlation (XC) potential (or equivalently XC energy), which encapsulates the quantum many-electron interactions as a unique functional of the ground-state electronic density. Furthermore, traditional strategies for obtaining further accuracy in the KS formalism are ambiguous. The goal of this project is to alleviate the shortcomings of existing XC approximation by modeling it through machine learning, using data from WFT methods. This approach entails two distinct steps. First, we use accurate groundstate densities from WFT methods and perform an inverse DFT calculation to obtain the exact XC potential that yield the WFT density. Subsequently, we use the density and XC potential pairs from multiple atoms and molecules as training data to model the XC functional—the functional dependence between the XC potential (or energy) and the density.
Tushar Krishna, Georgia Institute of Technology
Enabling Training and Inference of Large and Sparse Deep Learning Models
The end of Moore’s Law has necessitated a need for domain-specific hardware accelerators for efficiently running High Performance Computing (HPC) workloads. The Neocortex platform provides access to the Cerebras wafer-scale engine which is an accelerator that supports dataflow execution. The focus of this proposal is to develop and study efficient algorithms for key linear algebra kernels used in HPC workloads. Specifically, we will target Graph Neural Networks at the target workload, that include dense and sparse matrix multiplications. The PI is also part of the Department of Energy ARIAA center and will leverage ongoing research in key tensor kernels from the center and identify acceleration mechanisms using Neocortex.
Lei Li, University of California Santa Barbara
Investigating Large Language Models for Protein Sequence Design
Protein engineering has become a crucial research area in the fields of chemistry and biology, with the primary objective being to design proteins with desired properties. One of the significant challenges in protein engineering is designing novel protein sequences with improved properties, such as increased structural stability or enzyme activity. Recently, large language model~(LLM) becomes more and more popular and it achieves state-of-the-art performance on many natural language processing tasks. Regarding its strong ability to model discrete text sequence, it is straightforward to think about if it is possible to directly adapt LLM to protein sequence design? In this project, we aim to study this problem by guiding GPT-3 language model to design novel protein sequences with improved properties.
Gongbo Liang, Texas A&M University-San Antonio
Mutation-Based Adversarial Attacks on Neural Text Detectors
Neural text detectors aim to decide the characteristics that distinguish neural (machine-generated) from human texts. To challenge such detectors, adversarial attacks can alter the statistical characteristics of the generated text making the detection task more and more difficult. Inspired by the advances of mutation analysis in software development and testing, in this paper, we propose character- and word-based mutation operators for generating adversarial samples to attack state-of-the-art natural text detectors. This falls under white-box adversarial attacks. In such attacks, attackers have access to the original text and create mutation instances based on this original text. The ultimate goal is to confuse machine learning models and classifiers and decrease their prediction accuracy. We introduced a general framework for building the character- and word-level mutation operators. Several operators were demonstrated and evaluated using the text captions of the MS COCO2017 dataset and state-of-the-art neural language models. We believe the proposed mutation-based adversarial attacks can be used as a systematic way to evaluate the robustness of any language analysis models.
Hualou Liang, Drexel University
Parameter Efficient Fine-tuning for Large Language Models
Large language models have become the mainstay in natural language processing (NLP). These models entail high costs in terms of storage, memory, and computation time and it has motivated a large body of work on model compression to make them smaller and faster to use in real-world applications. One attractive solution to this problem is parameter efficient fine-tuning, in which we need only fine-tuning a subset of the model parameters. However, the question as to which subset to be trained to achieve the best result remains unanswered. In this project, we have initially analyzed different components in the commonly used BERT model to see which one undergoes the most change after fine-tuning. We show that output of LayerNorm changes the most among other model components when fine-tuned with Microsoft Research Paraphrase Corpus (MRPC), as one of the of the General Language Understanding Evaluation (GLUE) tasks. We further show that by only fine-tuning this component can have competitive performance to full fine-tuning and other parameter-efficient fine-tuning approaches. Moreover, we use Fisher information to assess which parameters in LayerNorm are most important in order to have even less parameters involved in parameter-efficient fine-tuning. After getting the resources, we plan to test our hypothesis on the rest of the GLUE tasks before we apply the model to a real-world application of the drug labeling data.
Tengyu Ma, Stanford University
Improving the Reasoning Capabilities of Large Language Models
Reasoning plays a central role in human cognition and human ability to solve problems, make decisions, and think critically. In the field of deep learning, large language models (LLMs) have recently obtained human-level performance on a variety of tasks including translation, question-answering, and summarization. Despite these successes, recent works such as GPT-4 have highlighted that LLMs still perform poorly in reasoning and mathematical problem solving. Current approaches adapt LLMs to math datasets using continual pretraining, gradient-based finetuning, or prompt engineering to elicit step-by-step reasoning. Though these methods have significantly improved LLM performance on tasks such as grade school and high school math, LLMs still make simple arithmetic errors and struggle to write complex proofs. We aim to improve the performance of LLMs on mathematical problem solving by (1) benchmarking LLMs on challenging reasoning tasks to better understand their failures, and by (2) introducing and evaluating novel finetuning-based methods for adapting LLMs to reasoning tasks. We anticipate that a key to our methods will be teaching LLMs how to plan, brainstorm, and backtrack in the process of writing mathematical arguments.
Ryan Mills, University of Michigan
Molecular Mutagenesis by Biological Graphs
Variation in gene expression is a complex process correlated with multiple molecular features, such as chromatin structure, epigenetic marks, gene-gene and protein/protein interactions, as well as post-transcriptional modifications. The assayable molecular contexts of a locus (such as methylation, histone modification, and chromosome conformation) are suggestive, not causal: no single feature is enough to reveal the entirety of genomic interactions. We are developing new methods representing genes as tissue-specific, multilayer heterogenous networks based on regulatory and interaction assays. The graph structure is then trained to regress quantitative measurements of gene expression through an attention-based graph neural network. Such an approach allows us to mutagenize the features within the structure to query the relative impact of molecular changes on expression at tissue-specific and gene-specific resolution. Our goal is to understand and discover the patterns of molecular combinations that come together and affect the regulation of gene expression, and to figure whether or not the varying impact of molecular features surrounding a gene create a type of regulatory language that describes gene function and genomic architecture. To do this, we require advanced GPUs for training large models, owing to the fact that genomics data is infamously large and heterogenous. The novel graph structures we are using regularly require more memory than multiple 32GB GPUs can provide.
Large-scale Pre-training for Natural Language to Code Generation
This project aims to create pre-trained models for natural language to code\ngeneration, the task of generating programs from natural language descriptions. This has the potential to make\nprogramming easier, and perhaps even allow for command and control of computers by non-programmers. Our research\nteam has a large amount of experience in this area, but lacks resources to scale models to very large datasets such\nas training on the entirety of github, which this proposal aims to address. We also plan to examine novel models for\ncode generation based on non-parametric models, which look up related examples in a training corpus, which is\nimportant both for performance and interpretability. All models we develop will be made available open source for the community to use.
Dhabaleswar Panda, The Ohio State University
Exploring large-sample DL training on the CS-2
Efficient DL training on large sample sizes (e.g. long-sequence language modeling and large-image vision modeling) would provide wide-reaching improvements to cutting-edge applications in vision (medical, geospatial, and astronomical imaging), and language (Document summarization, extractive Q&A, DNA sequence analysis). However, storing the large samples on accelerator memory is a challenging paradigm for GPU-based HPC systems to tackle due to the limited HBM storage on GPUs. The configurable MemoryX solution provided for Cerebras systems, however, decouples accelerator memory from the accelerator. Therefore, we propose training large transformer-based and CNN-based models on large-sample datasets on the CS-2 architecture. We believe such a pairing will demonstrate the unique strengths of Cerebras hardware on a wide-reaching application domain.
Konasale Prasad, University of Pittsburgh
Discerning the complex pattern of brain networks related to psychotic disorders
Schizophrenia is a severe and chronic brain disorder associated with delusions, hallucinations, disorganized thoughts, and cognitive impairments. Available treatments are symptomatic and do not provide lasting recovery in majority of persons with schizophrenia. Therefore, better elucidation of neurobiology of this illness may help design new treatments. Studies to date clearly support that schizophrenia and related psychotic disorders are dysconnection syndromes. Hence, there is tremendous impetus to understand the nature and causes of dysconnectivity. However, current efforts are directed at examining networks built on one modality of data such as diffusion or functional connectivity. Our lab, the CONCEPT Lab, uses multimodal MRI data, e.g., structural, diffusion-weighted, and functional imaging data to construct multiplex multilayer networks to delineate dysconnectivity related to schizophrenia at different levels. Using this approach, our goal is to understand schizophrenia networks on a nodal and global level as well as at the subject and the group level. Differences between the brain networks of persons with schizophrenia and healthy subjects have been extensively reported. Recent studies have been conducted using graph theoretical approaches. Although this approach provides important leads in elucidating network architecture and provide clues on potential functional impact, it does not provide means to examine the entire graph and characterize the differences. For examples, it does not answer questions on whether there are differences in the patterns of network architecture, what features of nodes tend to affect the strength of connections and whether the edge-centric pattern can help in classifying the networks. Graph Neural Networks (GNNs) are a way to identify graph difference and categorize networks. This set of machine learning approaches can help draw inferences on the nodes, edges, and the graph level characteristics. Using GNN, graphs and nodes can be classified into groups, and we will be able to make edge, or link, predictions. This will allow us to understand which graph connectivity pattern is unique to patients versus controls on a global and nodal level. Further, we are also interested in finding out if there are subgroups within patients since it is well known that schizophrenia is a heterogeneous disorder. These efforts will go a long way to help us understand the underlying pathology and how to better treat each network classification.
Bhiksha Raj, Carnegie Mellon University
Unsupervised labelling and learning from large audio datasets
Acoustic scene analysis is fast becoming a standard technology, expected in devices such as smartphones. But the latest solutions are limited by the availability of labelled training data. In this project, we propose to automatically label a very large quantity of audio data, to generate currently the largest dataset for use by the research community. This will, however, require the development of algorithms that can iterate over such large amounts of data and iteratively refine their automatically generated labels. On traditional machine learning hardware such as Graphical Processing Units (GPUs), we expect our approach to take several weeks or more of compute time for a single pass through the dataset, leading to unreasonable latencies in research (and development) time. We believe that the neocortex system can reduce the iteration time by orders of magnitude, and enable us to optimize our unsupervised inference algorithms and put out labelled data resources that will be of high value, not just to us, but to the research community at large.
Amar Ramapuram, Iowa State University
Physics informed distributed dynamic simulations of large-scale power grids for Optimization and Stability Analysis
The electric power system is the nation’s critical infrastructure. It consists of millions of individual devices that are sparsely interconnected through transmission and distribution lines. As the current and power only flows along these lines, the properties of a device, such as voltage and current drawn, is only influenced directly by the behavior of the immediate neighboring devices and the properties of the interconnecting lines. Non-convex Optimization on power grid operations can be reformulated into a primal-dual dynamical system that has distributed dynamics and whose equilibrium is the optimal solution. Similarly, stability analysis in power grids is performed by simulating the non-linear dynamical equations for various disturbances and observing the evolution of voltages and currents over long time scales. We can also calculate stability metrics using the voltage evolution during the simulations to understand how close the system is to a collapse. Conventional approaches for simulating the power grid dynamics have taken advantage of the sparse nature of the power grid using techniques such as sparse solvers, etc. However, these approaches have not utilized the distributed nature of the power grid dynamics due to the lack of the right computing architecture that can leverage this property. Neocortex fills this void by having sufficient memory to hold the entire state of the system in memory while also having ultrafast communication between neighboring computing cores. We can recast the dynamic simulations into a form where the evolution of a state of a grid component (generator, motor, transmission line, etc.) is based on the various states of the neighboring components. These dynamics can be simulated in a near real-time fashion. We envision that there is likely to be a speedup of ~20x (based on the analysis of the NETL CFD solution using neocortex) compared to existing approaches for systems >10k elements.
Amar Ramapuram, Iowa State University
Monitoring and Mitigating Electric Grid Instability due to Renewables Using Neural Networks
The electric power grid is an incredible feat of engineering and is essentially a very large interconnected machine whose operation is dictated by complex non-linear equations over various time scales. The highly non-linear nature of the grid demands that it should be operated in a very limited operating range in order to ensure robustness and stability to unexpected events. Conventionally, the power system operators used engineering judgment, offline planning and operation simulations in conjunction with linear analysis to operate the grid and compensate for the non-linearity by setting conservative thresholds. More recently, the increasing adoption of variable uncertain renewable power sources (wind power, solar, etc.) is challenging some of the core operational assumptions made during offline analysis. In this project, we will address the problem of monitoring and mitigating grid instabilities by leveraging the power of machine learning to learn a function that can predict the margin to instability given load demand and generation injection. This will enable the grid operators to monitor the grid in near-time and will be used to devise control schemes to mitigate an emerging instability in the grid.
Johann Rudi, Virginia Tech
Cerebras Accelerated Deep Neural Networks for Parameter Estimation in Scientific Models
Dynamical and stochastic processes are ubiquitous in scientific applications and these are often governed by parametrized equations that are deterministic differential systems and/or stochastic processes. We recently developed parameter estimation techniques based on deep neural networks to estimate parameters of such systems. Advantages of our approach are fast parameter predictions, which, as a consequence, accelerate applications with real-time and frequent estimation demands. However, the training phase of deep neural networks poses computational bottlenecks preventing a wider applicability of deep learning-based parameter estimation. Our aim is to overcome these limitations with recent advances in hardware and algorithms tailored to neural networks.
Sebastian Scherer, Carnegie Mellon University
Generic Visual Instance Search
Humans can rapidly create a mental picture of a novel object by quickly figuring out it’s 3D geometry. This geometry help to structure object search in complex scenes. This probably tries to develop methods that could make these mental pictures even though if the algorithm has never seen such an object or even never seen any object of the same type. This is a fundamental ability for humans, however, even the latest machine learning and computer vision models still cannot achieve this task. Here we propose to address the 3d zero-shot instance search task. By exploring the possibility of encoding 3D information of a target object, we will develop a set of new models and try to find a way to imitate the above human capabilities. We will approach this task by approaching the problem fundamentally from the ground up as a 3D problem and will leverage tools such as photo-realistic image generation, 3D reconstruction, salient object detection, and zero-shot learning. We expect this research to lead to fundamental advances in Generic Visual Instance Search. We are anticipating that the amount of data that get generated by this research is huge. And by the definiton of the underlying task, we would like to build a deep-leanring model that can take a sequence of images and process these images by plain convolutional neural networks and, possibly, visual transformer style image encoder and decoders. Considering the shear amount of data we need to handle and the size of the model, we expect that the capability of Neocortex can really boost or model development and evalution. Please see the attached PDF file for more information.
Tong Shen, Carnegie Mellon University
Toward Robust Object Tracking with Language and Vision Cues
The emergence of large multi-modal foundation models has altered the paradigm of machine learning and computer vision. Our project aims to create cutting-edge object tracking algorithms, built upon the pioneering foundation models. By integrating multi-modal cues, we will significantly elevate tracking precision and robustness, driving transformative advancements in the field. In this project, we will start from the vision-language foundation models and fine-tune them specifically for object tracking applications. We will also devise novel multi-model feature fusion modules, employing Transformers to better harness information from the language domain.
Chung Shih, National Energy Technology Laboratory
Wafer Scale Engine, Field Equation, Application Programming Interface (WFA) for Material Development
We propose to expand the Wafer Scale Engine, Field Equation, Application Programming Interface (WFA) by developing new kernels to solve MD and MC problems. Specifically, we propose to implement the spatial-decomposition method (ref 2) for MD simulations on WSE, and algorithms performed on Graphics Processing Units (GPU) (ref 3) for MC simulations. Many kernels, such as force, energy, periodic boundary condition, nearest neighboring link list and link cells, Ewald summation and Wolf’s method to conduct electrostatic interactions, excluded list to exclude specific atom pairs, various integrators to solve second-partial differential equations will be implemented. Note that although the same Newtonian equations are to be solved in both CFD and MD simulations, they are solved using completely different methods. In CFD, an Euler-type method was typically used to solve physical properties (such as temperature, pressure, and fluid velocity) in a fixed space. In contrast, in MD simulations, the Lagrangian method was used to track the moving history of each particle and compute the ensemble average over particles. For MC simulations, three different moves in MC simulations, that is, thermal move, volume change volume, and insertion and deletion of molecules, will be implemented on WSE. It is expected that implementing these MD/MC kernels on WSE will benefit a lot to various industries and academia, such as accelerating drug design, facilitating materials development for carbon capture, battery electrolyte, gas sensor, and beyond. Finally, we expect that researchers can utilize these low-level fundamental MD/MC kernels to develop more advanced simulation methods on WSE. The researchers will use Python to call these kernels to build more advanced simulation methods. This way, researchers do not need to handle and process complex operations on WSE. To start the MC project implementation, we will implement new kernels on WFA to solve the 2-D Ising model problem, which shares many significant features with the more complex atomistic MC simulations, such as the acceptance rule and periodic boundary condition.
Chung Shih, National Energy Technology Laboratory
Predicting subsurface CO₂ behaviors with deep neural network and fluid dynamics simulation
The National Energy Technology Laboratory (NETL) leads US DOE multi-lab SMART (Scientific Machine learning to Accelerate Real-Time decisions) Initiative, with one of the goals aiming at near real-time forecasting of CO₂ behavior after injection in subsurface storage reservoirs. One of SMART tasks is to explore and research advanced AI/ML techniques and computational technologies. In this proposed work, NETL’s SMART team aims at using CS-2 with the team’s AI/ML models and leveraging NETL’s Wafer scale engine Field equation Application programming language (WFA) to explore HPC-type solutions for subsurface. There is a separate PSC proposal from NETL to focus on maturing the WFA.
Shwetank Singh, Case Western Reserve University
Open Source Smart Watch Seizure Detector
Experiencing a seizure while alone increases morbidity and mortality. Studies based on national population registries have shown a 3.5 times higher odds of experiencing a sudden, unexpected death for people with epilepsy who sleep alone compared to those who sleep with a partner. This led to the development of seizure alert devices. Currently, over 20 seizure alert devices are commercially available. All are proprietary and none have undergone a clinical trial comparing performance in real time to human epilepsy doctors. The need for bespoke hardware with proprietary seizure alert devices advertises the medical condition of the user, which impacts technology adoption. Our work aims to create a clinically validated, open-source seizure alert system that can function using the accelerometers found in smartwatches. This will allow discreet deployment of models validated in a blinded clinical trial for detecting seizures.
Vivek Srikumar, University of Utah
Tensor networks and massively parallel language models on accelerator arrays
The gigantic size of recent transformer language models like GPT3 is due to the use of large dense weight matrices, too large to fit even on a wafer-scale engine like the Cerebras systems. There has been increasing recent interest in exploring more compact representations for weight matrices using various factored representations using tensor networks. In this project, we propose to explore two complementary research questions: 1) Can we develop compact and accurate factored language models that can fit within the CS-1 and achieve comparable accuracy to much larger models using the standard transformer architecture? 2) Can we develop effective customized mapping/scheduling techniques to enable high performance on the Cerebras CS-1 for training and inference with factored tensor networks?
Dingwen Tao, Washington State University
Accelerating Large-Scale Graph Neural Networks Training on Cerebras Wafer Scale Engine
Graph Neural Network (GNN) is a promising approach to efficiently learn from graph data structures, having shown advantages in many critical applications such as chemical reaction prediction, traffic state prediction, and text classification. This project is to investigate how to accelerate large-scale Graph Neural Networks by leveraging Cerebras Sparse Linear Algebra Compute Cores and our proposed new graph re-ordering algorithm. This project is well aligned with our NSF-funded large-scale DNN training project (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2034169).
Dirk Van Essendelft, National Energy Technology Laboratory
Developing field equation application programming interface for fluid dynamics applications
The National Energy Technology Laboratory and Cerebras Systems Inc. are developing a domain specific Wafer scale engine Field equation Application programming interface (WFA). It is designed to solve field equations on the Wafer Scale Engine (WSE). Initial results showed the outstanding performance of WFA in speed, power consumption, and carbon footprint. This project aims at further maturing the WFA and investigating the applications in computational fluid dynamics and other related fields. By maturing the WFA, users will be able to form and solve field equations with easy-to-use and high-level Python interface while maintaining high performance of the WSE.
Large Scale Machine Learning Force Fields for Metal Hydride Systems
Machine Learning has enabled the prediction of various material properties from formation energies, HOMO/LUMO levels to atomic energies and forces. The increasing number of material and molecule datasets available present an opportunity to train Machine Learning models on datasets larger than typically used in materials science including larger sets of descriptors and more model parameters. Computational cost for training typically limits the dataset size and model size. In this work we train Machine Learning models to predict scalar properties of metal hydrides, materials which have been shown to have high temperature superconducting properties as well as molecular datasets such as QM9 important in various chemical processing industries. NeoCortex will allow us to push the limits of the sizes of training sets and models at record training speeds in an attempt to beat the state of the art accuracy on scalar properties.
Saining Xie, New York University
One Model to Generate Them All: Scaling Multi-modal Diffusion Transformers on CS-2
The research project “One Model to Generate Them All: Scaling Multi-modal Diffusion Transformers on CS-2” aims to develop a new generation of infrastructure for Diffusion Transformers, which achieve outstanding performance on image conditional generation tasks and showcase the remarkable scalability of transformers within the diffusion framework. A major benefit of using a transformer backbone is its adaptability to multi-modal learning, allowing for the standardization and unification of architectural backbones across various domains. The project intends to train multi-modal transformer-based foundation models using CS-2, which could revolutionize the research landscape in this area. The proposed multi-modal diffusion transformers will utilize existing transformer blocks in PyTorch and will be extended to support various modalities such as images, audio, and video by designing unique conditioning modules.
Tao Yang, University of California Santa Barbara
Fast Document Ranking with Transformer-based Neural Models
This proposal studies efficient optimization for transformer-based neural document ranking. Recently transformer- based neural ranking with deep contextual models has been extensively studied in delivering a high relevance score for top k search of text documents. The main challenge is that using such a model to rank or re-rank documents is extremely expensive during the runtime inference. This proposal is focused on developing efficient solutions to perform transformer-based re-ranking computation during ad-hoc query processing. The evaluation process will use public datasets to assess the effectiveness of the proposed techniques in terms of relevance and efficiency.
Yingzhen Yang, Arizona State University
Model Compression for BERT and Its Applications
Vision Transformer (ViT) demonstrates that BERT-like Transformers for natural language processing can be applied to computer vision tasks and result in state-of-the-art performance. Despite achieving tremendous success, visual transformers demand much more resources than CNNs, making them difficult to be deployed on edge devices such as mobile phones and embedded devices. In this proposal, we propose two levels of compression for BERT-based models. In the bottom level, we propose to automatically search for the number of heads for each transformer block in a BERT-based model. In the higher level, we propose to search for the architecture of a BERT-based model by deciding the insertion locations of transformer blocks. We propose to compress popular unsupervised pre-training BERT models for computer vision tasks using the proposed compression methods. We will also combine visual BERT transformers with language BERT transformers for cross-modality learning tasks such as visual question answering and visual reasoning. The proposed compression method can be used to search for a tradeoff between efficiency of the model and the best interaction and connection manners between the visual and language branches.
Jianyi Zhang, Duke University
Revolutionizing Hardware Design: Harnessing Large Language Models for Automated Verilog RTL Task Completion
The hardware design industry has traditionally required manual processes for the creation and verification of Verilog RTL (register-transfer level) designs. However, with the emergence of large language models, there is growing interest in utilizing these models to automate various tasks in the hardware design process. Our project aims to establish a foundation model for efficient and trustworthy hardware design through the automation of Verilog RTL tasks using large language models. We plan to pre-train a T5-like large language model on a self-curated dataset consisting of approximately 130 million Verilog code samples. We also aim to modify the T5 model architecture to enhance its performance on downstream tasks such as code summarization and refinement. To further optimize the training process, we plan to design an efficient loss function. Our framework will utilize complementary libraries including Transformers, OpenAI, Pickle, and Datasets among others, in addition to the standard PyTorch and TensorFlow distributions. Our approach has the potential to transform the hardware design industry by utilizing advanced language models and cutting-edge computational resources to create more efficient and reliable systems. By establishing a foundation model for efficient and trustworthy hardware design, our project aims to revolutionize the Verilog RTL design process.
Learning the underlying molecular distribution of chemical spaces with large models for the generation and discovery of novel molecules
Learning and revealing the underlying complex molecular distributions of general chemical spaces, or specialized chemical spaces, for example, of special properties for drug development and discovery and other essential applications, can be of fundamental theoretical and practical importance. There has been increasing evidence that deep generative models of molecules, trained on relevant datasets, can be used to search through chemical space for novel molecules. It is our hope that, when trained on adequately large datasets and large models, it is possible to develop a generalized model for the latent space representation of the underlying complex molecular distribution of the chemical space in general. Such a model of the general distribution of the chemical space then could be adapted, refined, and specialized for subspaces of desired chemical properties via transfer learning on specialized datasets. Large datasets and large models mandate novel hardware and efficient and software algorithms for rapid iterations of the training and tuning processes. In this work we will port, further develop, and scale up a deep generative model, GENTRL, that we have been porting and testing during the early access phase of Neocortex. We will use increasingly large dataset and increasingly complex NN architectures to tune, refine, and benchmark the NN model against existing datasets and molecules that are well characterized experimentally. In addition, we will further develop and apply a novel algorithm, recently developed inspired and based on conservation of energy/cost in the hypersurfaces of the cost functions of the NN model, for optimization of NN parameters.