Feature Stories

Massive Data

Designing solutions for large-scale graphs
  • Sine wave with computer screens in the background

“Many of the graphs that we need to process these days are massive and are constantly changing. This necessitates new approaches to the design and analysis of algorithms for these graphs."
- Andrew McGregor

With the exponential growth of big data for far-ranging applications comes the need to design and develop algorithms for these systems. UMass Amherst computer scientist Andrew McGregor–an algorithm designer and expert in processing massive data sets and data streams, clustering, and information theory–is working to address the particular challenge of large-scale graphs.

Many types of data are naturally represented as graphs from the transportation networks represented in Google Maps to social networks such as Facebook and Twitter, and protein-protein interaction networks in biology. Implicit in the design of many algorithms is the assumption that the graphs are static and small enough to fit in the main memory of a single machine. In many applications, however, these assumptions are no longer reasonable.

“Many of the graphs that we need to process these days are massive and are constantly changing,” says McGregor. “For example, the web graph has over ten billion nodes and hyperlinks are constantly being added or removed. This necessitates new approaches to the design and analysis of algorithms for these graphs.”

McGregor explains that algorithms whose running time is even quadratic in the size of the graph can often no longer be considered practical.  Algorithms also may need to handle data that is distributed between numerous machines or is defined by data streams. McGregor and his research group have been investigating various aspects of the problem.

One approach to dealing with a massive graph is to first compress it in a way that approximately preserves the graph’s relevant properties. The idea is analogous to using MP3 and JPEG files on a cell phone rather than the original songs and images. While some of the information is lost in the translation, the aim is to make the difference imperceptible to the user.

It is typically faster to process a compressed graph rather than an uncompressed graph simply because the input is smaller. The compressed graph may also fit in main memory thus avoiding input/output bottlenecks, and even if the graph data is stored across many machines, compressing the input reduces communication overheads.

A popular technique for compressing some types of data is to use “random projections”. For example, it is possible to represent a collection of text documents as vectors where the entries encode the frequency of the different words in the corresponding document. The similarity of two documents can then be measured in terms of the distance between the corresponding vectors. By first applying random projections to the vectors, it is possible to dramatically reduce their dimension without significantly perturbing the distance between any pair of vectors.

It initially seemed unlikely that the random projections technique, also known as sketching, would be of use for resolving graph problems. In particular, the solution to a graph problem can be sensitive to slight changes in the input.

“We actually spent a considerable amount of time failing to prove impossibility results,” admits McGregor “and I remember very clearly the moment I realized there was a good reason; because it was, in fact, possible.”

The solution devised by McGregor and his colleagues combines ideas from spectral graph theory and a new sampling technique developed in the context of data stream processing. McGregor’s research group has been working on extending their results to other classic graph problems. Applications include designing faster data structures for dynamic graphs, verification of outsourced graph processing, and new “sliding window” algorithms to ensure that no outdated information corrupts the graph being processed.

McGregor is a member of the theory group in the School of Computer Science, who use mathematical techniques to study problems through computer science, working on a range of problems from network algorithms to computational complexity theory.

Adapted from Significant Bits news (Computer Science)