The University of Maryland (UMD) is part of a multi-institutional team tasked with building a powerful set of language technologies that can unlock information that has previously been unsearchable, and thus unfindable.
The four-year project, funded by a $14.4M grant from the Intelligence Advanced Research Projects Activity (IARPA), is expected to produce a language processing system that allows a user to type in a query in English and have information returned in English—even if the content is only available in a lesser-known language like Croatian.
The project involves faculty, postdocs and students from Maryland, Columbia University, Yale University, the University of Cambridge, and the University of Edinburgh. Columbia is the lead institution, with Kathleen McKeown, the founding director of Columbia’s Data Science Institute, serving as principal investigator.
The interdisciplinary research—already underway—includes experts in natural language processing, speech processing, and information retrieval.
“Today’s internet brings us closer together than ever before, but the diversity and richness of human language remains a challenge,” says Douglas Oard, a professor at the College of Information Studies (Maryland’s iSchool), who is heading up the UMD research team. “Computers can be trained to transform human language in many useful ways, but today that training process is still too expensive to affordably be applied to all the world’s languages, and too dependent on the artisanal skills of a small number of experts.”
Joining Oard at Maryland are Philip Resnik (professor, linguistics), Marine Carpuat, (assistant professor, computer science), and Hal Daumé (professor, computer science and Language Science Center). These four faculty all have appointments in the University of Maryland Institute for Advanced Computer Studies (UMIACS), where they work together in the Computational Linguistics and Information Processing (CLIP) Laboratory, one of 16 centers and labs in UMIACS.
The system they are building, called SCRIPTS—which stands for System for Cross Language Information Processing, Translation and Summarization—will take advantage of the latest advances in computing technologies. This includes machine-learning algorithms that can sift through large amounts of human language, looking for commonalities in syntax and semantics.
When completed, SCRIPTS will be able to transcribe speech from multiple sources such as videos, news broadcasts and some types of social media. It will also process text documents like newspapers, reports and social media posts.
The system will use multiple strategies, such as matching an English query against translated documents and then summarizing the result. It will also be able to search and summarize directly in the foreign language, and then translate the selected summaries into English.
“The collection and analysis of information required to accomplish a specific intelligence task has increasingly become a multilingual venture,” says Carl Rubino, who is leading IARPA’s MATERIAL program. (MATERIAL stands for Machine Translation for English Retrieval of Information in Any Language.)
For most languages, Rubino says, there are very few automated tools for cross-lingual data mining and analysis. “MATERIAL aims to investigate how current language processing technologies can most efficiently be developed and integrated to respond to specific information needs against multilingual speech and text data,” he says.
As it now stands, analysts must wade through multilingual document collections manually or use computers that are unable to translate languages that have a small digital footprint, known as “low-resource languages,” into English. In addition, many current systems don’t provide accurate translations of these low-resource languages.
For example, text written in Tagalog or Swahili—languages spoken by millions of people in the Philippines and East Africa, respectively—has far less digital content on which systems can be trained.
And if the language is originally retrieved from a news broadcast or other audio source, its pronunciation may not translate well to English, or there may be variable pronunciations for certain words, says Oard, who is an expert in cross-language retrieval.
“We’ve [already] built machines that learn from examples, but for these low-resource languages, we just don’t have enough examples,” he says.
This is where new technology will come into play. Using sophisticated “deep learning” systems, the SCRIPTS team will begin to compile documents in several low-resource languages that have been selected by IARPA as representative examples. They’ll develop new algorithms to analyze language patterns such as sentence structure and morphology, which is how words are formed and their relationship to other words in the same language.
Deep learning-based translation systems under development at Maryland will take limited amounts of information from the low-resource languages, churn it with other language-related data from better-resourced languages, and come up with powerful new tools that will allow for the manipulation and transformation of content in those languages.
“In order for us to be able to do this kind of work, we need the ability to build new computing infrastructures that weren’t the same ones’ people were using as recently as five years ago,” says Carpuat, an expert in multilingual text analysis who is working on machine translation capabilities for SCRIPTS.
Perhaps of greatest significance, the researchers say, is that SCRIPTS is designed to incorporate four key areas of language processing—speech recognition, machine translation, cross-language retrieval, and information summarization—into one, robust platform.
“Translation, retrieval, and summarization are all areas that CLIP has previously excelled in,” says Resnik, a computational linguist who is the current director of the CLIP lab. “But these tasks all needed to be done within separate systems. Now—with the use of deep learning neural networks—it allows us to combine functions and do a single ‘training’ of the system across multiple functions quickly and efficiently.”
Resnik says that in addition to the four UMD faculty, CLIP has added a postdoc and a research staff member to work on the IARPA project. There are also five UMD doctoral students involved with the research.
Looking ahead, the CLIP lab faculty envision even more powerful computing systems being used to assist with multilingual information management.
“Computational methods evolve rapidly,” says Oard, who notes that the Maryland team is already working across a full range of modern computing architectures—from high-performance computing, to the latest distributed processing systems, to deep learning clusters.
In the future, he adds, the researchers might even consider the next-generation quantum computing techniques being developed at UMD.
“We work together with sponsors like IARPA to leverage these technologies in the service of our society, to help transform the way we all can take best advantage of the increasingly information-abundant world in which we live,” Oard says.
About CLIP: The Computational Linguistics and Information Processing (CLIP) Laboratory at the University of Maryland is engaged in designing algorithms and building systems that allow computers to effectively and efficiently perform language-related tasks. CLIP is one of 16 labs and centers in UMIACS.
About IARPA: Launched in 2006, the Intelligence Advanced Research Projects Activity invests in high-risk, high-payoff research programs that address some of the most difficult scientific challenges faced by the U.S. intelligence community.