Teaching the Machine to See

Blacklight “Trains” Video Search System for Competition Victory

A good example that we live in the era of Big Data is that, as we’ve moved from super-8-film home movies to ever-present smartphones, we’ve all begun to generate so much visual imagery that we seldom look at a given video more than once. Worse, when we do want to find a video clip, it’s lost among thousands of others.


Machine intelligence researchers Shoou-I Yu and Lu Jiang, working with colleagues on Carnegie Mellon University’s Alexander Hauptmann’s Informedia project and at PSC have developed E-Lamp, a system of “event detectors” designed to search for events in videos without human intervention. Such a detector could help us all keep better tabs of our videoelectronic lives.

 “When we tried to train the system on our own computer cluster, we were overloading our file system. Blacklight gave us eight times the speed, and we were not breaking the file system.”

—Shoou-I Yu, Carnegie Mellon University 

machinelearnE-Lamp consists of a series of tools that start with a definition of a kind of event (left), and then scans videos for sounds or images that are associated with those definitions (center). The machine can’t know—it can only make statistical guesses. So the final step is to rank the possible “hits,” with users providing feedback that help the system learn (right).

The task of finding a video of a birthday party, for example, is fairly easy for a person. But it’s extremely hard for a computer: All the cues a machine might use to spot a video, including color, shape, sounds and even captions can be misleading. To accomplish this difficult task, the researchers turned to PSC’s Data Exacell system to manage a vast volume of data and an XSEDE allocation on PSC’s Blacklight supercomputer, which offered a large number of processors and very large memory. The system, with support from XSEDE’s Extended Collaborative Support Service and Novel and Innovative Projects Program, allowed them to employ a huge number of potential clues. The team was also able to develop a larger number of “concept detectors”—elements in the E-Lamp system for searching for specific things, such as birthday parties. The team increased the input of “training” videos into E-Lamp from 0.2 million video shots in its 2013 version to more than 2.7 million shots in the current version.

They also increased the number of concept detectors from 346 to over 3,000. At the National Institute of Standards and Technology’s 2014 TREC Video Retrieval Evaluation workshop, E-Lamp outperformed all other competitor systems in searching for videos based on either queries given to the researchers ahead of time or queries sprung on them at the competition.

“The system builds a model for detecting a concept, then tests that model. Then it builds an improved model and tests that. It asks itself, ‘If I have features that look like this, will they help me do the best job in discriminating videos with dogs from videos without dogs?’” —Alexander Hauptmann, Carnegie Mellon University


“If a video used to train the system is misleading—for example, a ‘vacation video’ that shows people changing a tire—it can be a disaster for accurately identifying a vacation. Blacklight allowed us to use more sample videos, and when you have more ‘correct’ samples and a smart algorithm, the training is more robust to the misleading samples.”—Lu Jiang, Carnegie Mellon University