Data Scientists Plan to Improve Machine Learning Algorithms for Grouping Objects

Image
Andrew McCallum
Andrew McCallum

Computer science researchers Andrew McCallum and Akshay Krishnamurthy have received a four-year, $1.1 million grant from the National Science Foundationto study new ways to use machine learning and algorithms for automatically grouping massive numbers of objects based on similarities, also known as extreme clustering.

Krishnamurthy is currently on leave from campus while at Microsoft Research.

As McCallum, who is director of the Center for Data Science, director of the Information Extraction and Synthesis Laboratory and a Distinguished Professor in the College of Information and Computer Sciences, explains, there are 4 million photos of animals online at Flickr, and of those, 2.6 million portray a dog, of which 302,000 are Labrador retrievers. To make these images more useful and accessible to users, it helps to group them by specific attributes such as coat color, size and age, for example. Doing this for all 2.6 million dog photos is where extreme clustering comes in–it offers a way to assign a massive number of objects into different groups, he says.

McCallum and Krishnamurthy’s research will develop new machine learning algorithms for extreme clustering that scale to both massive numbers of objects and massive numbers of clusters. They plan to demonstrate their applicability on multiple datasets such as chemical compounds for material science discovery, single-cell genome data, or resolving ambiguity among author names in bibliographic databases.

According to doctoral students Nicholas Monath and Ariel Kobren, who work closely with McCallum on this project, advances in clustering are expected to lead to more precise analysis of large datasets, improving users’ ability to discover trends in a wide variety of domains such as medicine, academic publishing and social networks. Clustering also should lead to better tools and applications for users interacting with big datasets.

This research builds on McCallum, Monath and colleagues’ award-winning work on rapidly removing inventor ambiguity from patent records, which in 2016 won an international competition sponsored by the U.S. Patent and Trademark Office. Data queries there sometimes required time-consuming manual intervention because many inventors may have filed patents under the same or similar names.The UMass Amherst algorithm made 225 years of patent and trademark data more available to the public, to innovators, businesses and policy makers.

McCallum has published more than 250 papers in many areas of artificial intelligence, including natural language processing, machine learning, data mining and reinforcement learning, and his work has received more than 60,000 citations.