Mind the Gap

MARC Program Helps Minority-Serving Institutions Prepare Students for 21st-Century Biology Careers

August 13, 2013

American biology education risks becoming a two-class system. The top-tier institutions understand that bioinformatics—using advanced computing techniques on biological problemswill soon be a job requirement in much of biology, and have expended considerable resources to create bioinformatics classes, degree programs and research centers. Students at institutions without such resources or expertise, on the other hand, are in danger of being left behind.

Last Updated on Monday, 14 October 2013 09:41

CFL Software, PSC Collaborate on Next Generation of Information Searching

JULY 18, 2013

SherlockSherlockNew software being developed by CFL Software may transform our ability to search for information in text documents as profoundly as search engines improved upon paper library card catalogs. The software, CFL Discover, will search electronic text documents far more completely and accurately than possible with today’s search technologies.

Pittsburgh Supercomputing Center (PSC) is collaborating with CFL as a strategic partner in developing CFL Discover, making the software available to researchers on Sherlock, a modified version of YarcData’s UrikaTM, a real-time data discovery appliance at the center.

“This is a new venture both in terms of scale and speed in searching for information,” says David Woolls, CEO of CFL Software, which specializes in linguistic document forensics. “In essence, we take over where search engines stop.”

While many users may not be aware of it, search engines don’t completely search all the text in the entire Web — that would take far too long. Instead, they search indexes, keywords, categories and other “metadata” that have been added to those documents. In the case of keywords and categories, that addition has to be made by humans, and so is time-intensive and incomplete. Today’s engines obviously revolutionized our ability to find information, but they are inexact. Many irrelevant sites pop up, and many sites that may be more suitable aren’t captured. In a sense, we all stop when we reach a site that is “good enough” rather than one that’s best for our needs.

“Search engines start with a few words and return a list of documents which contain them,” Woolls adds. “CFL Discover starts with one or more of those documents and reads them for you, shows you the terminology that is shared and gives immediate access to the passages of particular interest to you.”

The program uses YarcData’s industry-standard SPARQL query language and RDF (Resource Description Framework) to search entire texts for meaningful connections between the words in a search query and related language in other texts. This kind of “graph search” enables someone searching for information to find relevant connections that they may not have thought of. The program is written in Java, so is platform independent and can work on anything from a standard PC to a Java-capable supercomputer. (While most supercomputers can’t run Java, two at PSC — Sherlock and Blacklight — do, providing valuable support for research communities that primarily use Java for data analytics.) The choice of platform and computer is solely dependent on the volume and speed of response required.

“It’s less like searching for a needle in a haystack than searching for a needle in a needlestack,” says Arvind Parthasarathi, President, YarcData. The advantage of CFL Discover is that it allows related groups of documents to be rapidly identified, not on the basis of pre-determined keywords and categories, but purely on the similarity of the content. This in turn allows the rapid creation of new combined databases from a collection of existing databases. For example, when searching Wikipedia, entering the title of an article causes CFL Discover to read the database, returning a comprehensive list of potentially interesting articles related to the whole content. And because the framework is RDF, searches of other RDF collections can be readily performed. The principles on which the program works allow it to be used in many different languages, including Arabic, Chinese, Thai and Finnish, which appear to be very disparate to the human eye.

“The structures and sequences inherent to individual documents are all that are needed to encode them,” Parthasarathi says. “New material is easily added to existing stores and is immediately available for use by the search queries.”

CFL Software has carried out proof-of-concept studies of CFL Discover to search U.S. Patent Office record and legal document description sections as well as Wikipedia. The collaboration with PSC will employ the program on PSC’s Sherlock, which is optimized to search extremely large and complex bodies of information with open-ended queries. The new work will explore a substantial portion of the U.S. Patent database, in addition to the full data of Wikipedia in more depth.

“PSC’s role in the partnership is to couple the unique analytic capability of Sherlock running CFL Discover with hosting massive datasets on PSC’s Data Supercell to expand text analytics to unprecedented, interdisciplinary use cases,” says Nick Nystrom, PSC’s director of strategic applications. “Response time is critical for exploring big data, and Sherlock with CFL Discover will provide rapid analyses of unstructured text data larger than can be done on any platform currently available to U.S. researchers.”

“We see high value for a wide range of research and societal applications,” Nystrom adds. Examples include analyzing recent events from news and social media sources, extracting deeper insights from sets of publications, and enabling computational history and culturomics — the quantitative study of cultural phenomena by analyzing large volumes of written records. “Application of high-performance analytics is new to these and similar fields, and will catalyze new ways of leveraging unstructured text data.”

Last Updated on Thursday, 18 July 2013 10:23

Blacklight Research Spurs Change in Stock Exchange Rules

July 15, 2013

Findings on the effects of “odd lot” trades on the financial markets, using computations on PSC’s Blacklight, have led the New York Stock Exchange, the Nasdaq Stock Market and the Financial Industry Regulatory Authority Inc. to redefine how the industry tracks small stock trades. The new rules will be enacted in October.

Previously, odd lots — trades of 100 or fewer shares — did not have to be reported to regulators. The rationale was that these trades involved small investors who were unlikely to affect the larger market significantly. But recent volatility in the markets, driven by automated small trades that occur far faster than any human can think, called that assumption into question.

In an upcoming paper in The Journal of Finance, Mao Ye, University of Illinois, Urbana-Champaign, Chen Yao of UIUC and Maureen O’Hara, Cornell University, report that odd lots are playing an increasingly important role in the wider behavior of the markets. The researchers used Blacklight and the San Diego Supercomputer Center’s Gordon to analyze market data for the effects of odd-lot trading.

“For every 100 trades of Google, 52 to 53 of them” are in the form of odd lots, Ye observes. “There are more missing trades than trades you can see. In terms of volume, more than 20 percent of the trading volume [among all stocks] is missing” in the official count.

The widely held suspicion is that the largest and most sophisticated traders are using automated trading in odd lots to hide their activities from other traders. In any case, the researchers showed that including the odd lots significantly alters our understanding of the markets. Partly in response to this research, in June 2013 the market authorities agreed to a plan to require all trades, of as few as one share, to be reported.

“In the U.S., they care a lot about the transparency of the market,” Ye explains. The new rule change will remove “a kind of darkness we cannot see and that we never realized was there.”

PSC covered the group’s work in detail in a recent article that you can find here.

Last Updated on Monday, 15 July 2013 08:09

Page 2 of 6

People. Science. Collaboration.

PSC's Bi-annual Publication (select issue to download PDFs)

PSC Spring2014c

PSC2013 covers web    Projects2012

Subscriptions: You can receive PSC news releases via e-mail. Send a blank e-mail to psc-wire-join@psc.edu.