UMass Amherst Researcher Leads a Breakthrough in Diagnosing Software Failures

AMHERST, Mass. – It is a dream of generations of computer practitioners that they don’t have to debug software failures. The debugging becomes even more daunting in multi-core era since concurrency issues highly depend on a specific task interleaving that is extremely difficult to reproduce. Due to such reasons, programmers generally spend 75% of their working time on debugging, equating to more than 1,500 hours a year. Fortunately, a research team led by Tongping Liu of the University of Massachusetts Amherst designed and implemented a novel tool that could automatically diagnose software issues within seconds. This tool is expected to put an end to hard-to-debug issues, especially for software running in deploy environments.

“No matter how much technology advances, software failure is always a concern,” said Liu, assistant professor of electrical and computer engineering at UMass. “Our novel diagnosis system, Watcher, can pinpoint root causes of program failures within the failing process, or ‘in-situ,’ eliminating privacy concerns. Watcher only reports failure reasons to programmers, instead of sharing the whole memory coredump.”

Watcher combines identical record-and-replay, binary analysis, dynamic analysis, and hardware support together to perform the diagnosis without human involvement. Research describing the Watcher system was recently published in The Proceedings of the Association for Computer Machinery on Programming Languages (OOPSLA’20), a top conference in Programming Language Field. Liu led the research team that included scientists from Purdue University, the University of Texas at San Antonio, and the University of Illinois at Urbana-Champaign. Liu started this project with his Ph.D. student Hongyu Liu, who was a part of the team, when he was working at the University of Texas at San Antonio.

Software failures can impact many aspects of every-day life from financial transactions to different systems used by retail stores and large institutions. Liu says that Watcher can be easily deployed without requiring custom hardware or operating system, program modification or recompilation. The research team evaluated Watcher with 24 program failures in real-world deployed software, including large-scale applications. The team’s results show that Watcher can accurately identify the root causes in only a few seconds.

Liu, who studies software systems and security, says that software always contains latent bugs. Although software testing helps identify these bugs, developers are often pressured to rush software to release to the market without comprehensive testing. It’s also impossible, he says, to eliminate all bugs in large software via testing, especially concurrency bugs. As a result, bugs usually escape in-house testing undetected and lurk in the production phase, which causes system crashes, program hangs, or security breaches.

Watcher will enable software developers to diagnose software failures within the failing process. It is built on top of their own record-and-replay system (iReplayer, published at PLDI’18), which divides the whole execution into multiple epochs (e.g., intervals of execution) and supports endless re-executions of the last epoch. More importantly, iReplayer ensures the identical re-executions even for parallel applications. That is, the same instruction in the re-execution will perform the same read/write from/to the same memory unit/register as the original execution. This provides a strong basis for the failure diagnosis.

The basic idea of Watcher can be seen in the above figure. Watcher focuses on the explicit program crashes that will generate explicit failure signals, which triggers the on-demand diagnosis within re-executions upon crashes. Watcher is based on a key observation of program crashes – many program crashes are typically caused by writing an incorrect or invalid value to a memory unit. Based on this, Watcher proposes a hybrid analysis. Upon crashes, Watcher will first identify the relevant memory unit that contributes to a crash. It utilizes binary analysis to identify all possibilities, and then confirms the actual one with dynamic analysis, with the assistance of debugging points – breakpoints. In software development, a breakpoint is an intentional stopping or pausing place in a program, put in place for debugging purposes.

Watcher places breakpoints on all possible instructions, and then employs the “last-win” rule to remove irrelevant instructions/branches. After identifying the memory unit, further tracking the origin of the erroneous value in that memory unit with another type of debugging points -- watchpoints. It installs watchpoints on the specific memory unit so that it could collect all relevant instructions that performs read/write on this instruction. With a static analysis and “last-win’’ rule, it then can determine the failing instruction. Watcher is able to trace the transitivity of the erroneous value, which report the root cause chain that crashes a program.

Watcher also employs software breakpoints to reduce the number of re-executions, and goes beyond the limits of debugging registers most modern hardware contain. When tracing hardware, such as Intel Processor Tracer, is available, Watcher could further employ it to further reduce the number of re-executions and eliminate false positives caused by control-flow hijacks that the executions will be redirected to a target location that cannot be reached in a normal execution.

One of the advantages of Watcher is that it can be used in all phases of software development, staging deployment and production. For instance, Watcher can be extremely useful to help identify program failures in autonomous-driving systems, which cannot transfer the whole memory image due to wireless communication or telecommunication bottleneck. Instead, Watcher identifies root causes of failures, which only transmit the failure statement to programmers. This is due to the fact Watcher proposes the first in-situ diagnosis that can identify software failures in the failing process, overcoming multiple issues of offline analysis or static analysis.

Watcher: In-Situ Failure Diagnosis for In-Production Software YouTube Video