Blacklight: Incubating Research in Machine Learning and Natural Language Processing
Blacklight, PSC's SGI Altix UV 1000, on which applications can access up to 16 TB of hardware-enabled coherent shared memory, is enabling groundbreaking research in machine learning (ML), natural language processing (NLP), game-theoretic analysis, and related computer science disciplines. These projects are paving the way toward automated reasoning and unprecedented data analytics.
Very large, hardware-enabled coherent shared memory is especially valuable for developing computer science algorithms (e.g., in machine learning), for several reasons:
- Shared Memory: Being able to access up to 16 TB of shared memory, either from many threads or even from a single thread, frees the programmer from having to distribute data explicitly, as is the case when using MPI on a distributed-memory computer. This is especially valuable for algorithms where data sizes and access patterns are irregular and difficult or impossible to predict, e.g., graph algorithms. Similarly, being able to stage a full dataset into memory and then perform complex operations on it is much more efficient than repeatedly accessing fragments from distributed disks.
- Single-System Image/ High Thread Count: Each of Blacklight's two 16 TB, 2048-core partitions runs a single system image (SSI), i.e., a single instance of the SUSE Linux operating system. This allows applications with very high thread counts, easily in thousands and potentially much higher, to express algorithms conveniently and productively, building on computer scienceâ€™s existing, large code base. Algorithms in machine learning and related disciplines are often expressed using POSIX threads (p-threads) or Java threads.
- High Productivity: The combination of SUSE Linux with being able to access 16 TB of coherent shared memory flexibly from one thread or from thousands of threads enables an unusually wide range of familiar, highly productive languages and analysis tools. These include Java, Python, other scripting languages, R, and Octave, to name a few.
- Scalable Software: Blacklight is currently the only resource on XSEDE that supports GraphLabGraphLab, a parallel framework for machine learning. Developed at Carnegie Mellon, GraphLab implements machine learning algorithms that are scalable, efficient, and provably correct, and it is used by several of the ML projects at CMU.
- Datasets: Blacklight hosts datasets that are important for machine learning research; for example, ClueWeb09.
A few examples of ongoing research in machine learning, natural language processing, and related data analytics are as follows:
Figure 1. Smith's group has looked at predicting the number of downloads of papers published at the National Bureau of Economic Research (an influential repository of working papers in economics and finance), based on the paper's content. With "download" information for each paper, they map the requesting IP addresses to regions of the world, and then determine what aspects of a paper's content correlate with its download "popularity". The chart shows the most informative single-word features for each of seven regions of the world.
Research being done by Noah Smith group at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University, spans several topics in statistical natural language processing (NLP). Text driven forecasting is one new application of NLP: given some text, make a concrete prediction about future measurable events in the world. An example is forecasting the impact of a scientific article, using its text content [Yogotama2011]. Impact might be measured as the citation rate or download rate on the web. The group has been exploring a wide range of text features and forecasting methods for this problem, using datasets of articles from two fields (economics and computational linguistics). They have been able to find trends over time and across geography, using download data for articles in the economics literature as in Figure 1. Because their models are linear in human-intelligible text features, they are understandable (the models "speak English"), making these techniques an excellent way to connect with social scientists. Smith says, "The large amounts of shared-memory available on Blacklight have been crucial to our ability to model things like this, since the models are parameterized with very large numbers of features, and training requires iteratively re-estimating the model parameters."
Another major application of NLP is machine translation: turning text in one language into semantically equivalent text in another language. The statistical approach to translation learns from examples of human-translated sentences, inferring a hidden alignment structure between words and phrases. Smith's group reformulated one of the dominant models of alignment as a conditional random field with latent variables. They showed that well-designed features that are unavailable in more traditional generative models can lead to significant increases in translation quality for Czech-English, Chinese-English, and Urdu-English translation. A paper describing this work was accepted for publication by the Association for Computational Linguistics
The group also considered the related problem of experimental methodology in translation. The dominant optimization routine used to build translation models from data involves randomness, but most experimental research fails to take into account this randomness when testing significance (i.e., in comparing two systems) so they proposed a computationally intensive solution based on sampling. Using Blacklight to generate a very large number of samples, they demonstrated empirically that estimates obtained with only a few samples (and therefore within range of commodity hardware) can determine with high confidence whether the difference in two experimental conditions is valid.
Leveraging Supercomputing for Large-scale Game-theoretic Analysis
Automatically determining effective strategies in stochastic environments with hidden information is an important and difficult problem. In multiagent systems, the problem is exacerbated because the outcome for each agent depends on the strategies of the other agents. Consequently, each agent must incorporate into its deliberation the expected actions of the other agents. For many such environments, game theory has proven to be an effective tool, both for modeling the situation and for providing prescriptive solutions. In principle, optimal strategies can be computed for sequential imperfect information games. However, the size of the game trees can be enormous, and in order to compute optimal strategies, the entire game tree must be considered at once. Tuomas Sandholm's recent research advances in automated abstraction and equilibrium-finding algorithms have opened up the possibility of solving two-person zero-sum games many orders of magnitude larger than what was previously possible.
The class of sequential imperfect information games includes poker as a special case. Poker games are well-defined environments exhibiting many challenging properties, including adversarial competition, uncertainty (with respect to the cards the opponent currently holds), and stochasticity (with respect to the uncertain future card deals). Poker has been identified as an important testbed for research on these topics. Consequently, many researchers have chosen poker as an application area in which to test new techniques. In particular, Heads-up Limit Texas Hold'em poker, now a benchmark problem in ML, has recently received a large amount of research attention.
The group introduced two techniques for speeding up any gradient-based algorithm for solving sequential two-person zero-sum games of imperfect information. Both of the techniques decrease the amount of time spent performing the critical matrix-vector product operation needed by gradient-based algorithms. They also specialized their software for running on a ccNUMA architecture like Blacklight, which is becoming ubiquitous in high-performance computing. The two techniques developed can be used together or separately.
Sandholm's research group computed the strategies for their bots for the AAAI 2010 Annual Computer Poker Competition using their fast-EGT equilibrium-finding algorithm. Playing Heads-up No-Limit Texas Hold'em poker, their program, Tartanian4, won the Total Bankroll category and and placed third in the Bankroll Instant Run-off category.
[Yogotama2011] D. Yogatama, M. Heilman, B. O'Connor, C. Dyer, B. Routledge, and N. A. Smith. Predicting a Scientific Community's Response to an Article. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011, http://aclweb.org/anthology-new/W/W11/W11-22.pdf.