AI & RoboticsNews

MIT and IBM develop AI that recommends documents based on topic

Even the best text-parsing recommendation algorithms can be stymied by data sets of a certain size. In an effort to to deliver faster, better classification performance than the bulk of existing methods, a team at the MIT-IBM Watson AI Lab and MIT’s Geometric Data Processing Group devised a technique that combines popular AI tools including embeddings and optimal transport. They say that their approach can scan millions of possibilities given only the historical preferences of a person, or the preferences of a group of people.

“There’s a ton of text on the internet,” said lead author on the research and MIT assistant professor Justin Solomon in a statement. “Anything to help cut through all that material is extremely useful.”

To this end, Solomon and colleagues’ algorithm summarizes collections of text into topics based on commonly-used words in the collection. Next, it divides each text into its five to 15 most important topics, with a ranking indicating each topic’s importance to the text overall. Embeddings — numerical representations of data, in this case words — help make evident the similarity among words, while optimal transport helps to calculate the most efficient way of moving objects (or data points) among multiple destinations.

The embeddings make it possible to leverage optimal transport twice — first to compare topics within the collection and then to measure how closely common themes overlap. This works especially well when scanning large collections of books and documents, according to the researchers; in an evaluation involving 1,720 pairs of titles in the Gutenberg Project data set, the algorithm managed to compare all of them in one second, or more than 800 times faster than the next-best method.

Moreover, the algorithm does a superior job of sorting documents than rival methods, for example grouping books in the Gutenberg dataset by author and product reviews on Amazon by department. It’s also more explainable in that it provides lists of topics, enabling users to better understand why it’s recommending a given document.

The researchers leave to future work developing an end-to-end training technique that optimizes the embedding, topic models, and optimal transport jointly as opposed to separately, as with the current implementation. They also hope to apply their approach to larger data sets, and to investigate applications to the modeling of images or three-dimensional data.

“[Our algorithm] appears to capture differences in the same way a person asked to compare two documents would: by breaking down each document into easy to understand concepts, and then comparing the concepts,” wrote Solomon and coauthors in a paper summarizing their work. “[W]ord embeddings provide global semantic language information, while … topic models provide corpus-specific topics and topic distributions. Empirically these combine to give superior performance on various metric-based tasks.”


Author: Kyle Wiggers
Source: Venturebeat

Related posts
GamingNews

Resident Evil Lore Master Reflects on What Requiem Got Right and Wrong as the Franchise Returned to Raccoon City and 'Retconned' Its Destruction

GamingNews

Pokémon Go Adding Item That Automatically Throws Pokéballs and Spins Stops, Helping to Play the Game For You

GamingNews

Crimson Desert’s Chaotic Combat Is the Best Kind of Messy – Until Bosses Show Up

CryptoNews

Bitcoin Slumps to $68K as Middle East Peace Hopes Fade

Sign up for our Newsletter and
stay informed!