Life, Death, and the Critical Transistion

Life, Death, and the Critical Transition:
Finding Liveness Bugs in Systems Code
Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat
University of California San Diego
{ckillian,jwanderson,jhala,vahdat}@cs.ucsd.edu

Abstract

ﬁnding bugs with model checking currently requires the programmer to have intimate knowledge of the low-level
Modern software model checkers ﬁnd safety violations: actions or conditions that could result in system failure. breaches where the system has entered some bad state. For
We contend that for complex systems the desirable bemany environments however, particularly complex con- haviors of the system may be speciﬁed more easily than current and distributed systems, we argue that liveness identifying everything that could go wrong. Of course, properties are both more natural to specify and more im- specifying both desirable conditions and safety assertions portant to check. Liveness conditions specify desirable is valuable; however, current model checkers do not have system conditions in the limit, with the expectation that any mechanism for verifying whether desirable system they will be temporarily violated, perhaps as a result of properties can be achieved. Examples of such properties failure or during system initialization. include: i) a reliable transport eventually delivers all mesExisting software model checkers cannot verify live- sages even in the face of network losses and delays, ii) all ness because doing so requires ﬁnding an inﬁnite exe- nodes eventually join an overlay, and, iii) a distributed tree cution that never satisﬁes one or more liveness proper- partitioned into two halves can eventually merge. These ties. We present algorithms to ﬁnd liveness violations with global liveness requirements specify that, in the limit, the high probability and the critical transition that moves the system should achieve some particular condition. system from an indeterminate state, where liveness can
Unfortunately, identifying liveness violations poses a still be achieved despite a temporary violation, to a dead much greater challenge than ﬁnding safety violations. For state, where it becomes impossible to ever achieve liveness. We present M ACE MC, a software model checker safety properties, simply ﬁnding a state that violates the that implements our algorithms and ﬁnds complex live- given condition proves the violation, whereas temporary ness errors in implementations of PASTRY, a reliable violations of liveness conditions are expected and quite common. For instance, ﬂagging a liveness violation in a transport protocol, and an overlay tree. temporarily partitioned network is premature. Rather, the model checker must verify that the system has entered a dead state where it becomes impossible to ever recover
1 Introduction to a live state. Model checkers that operate over ﬁniteThere has been substantial recent progress in employing state speciﬁcations locate liveness violations by ﬁnding a model checking to ﬁnd bugs in unmodiﬁed systems im- cyclic path without any live states. This approach is infeaplementations [15,27]. Model checking can improve code sible when checking real systems (that have inﬁnite state quality by ﬁnding bugs violating system-level correct- spaces) because such repeating sequences typically occur ness properties by systematically analyzing carefully con- beyond the horizon that can be exhaustively searched (a structed executions of the system. These runs correspond few dozen system events) and because complex system to the exhaustive exploration of all possible interleav- interactions (e.g., a timer unrelated to an error ﬁring peings of events and other sources of non-determinism— riodically) means that a repeating sequence may not be such as thread scheduling, message transmission, and I/O found even if a search to sufﬁcient depth were possible. latency—up to a certain maximum number of events. Existing software model checkers ﬁnd violations of safety conditions. For example, a developer may specify that a pointer should never be dereferenced after it is freed or that a node’s routing table should never accept advertised routes that contain the node itself to avoid loops. Thus,

The primary contribution of this work is a model checker and set of algorithms capable of ﬁnding both safety and liveness violations in systems implementations.
To ﬁnd liveness violations, we ﬁrst perform an exhaustive search to some maximum depth (limited by CPU time, memory, or both). From this periphery we then execute
1

strategically chosen random walks to sample the space beyond that which can be exhaustively searched to locate potential liveness violations. From this initial state we perform additional sampling to ensure a low rate of false positives in the set of violations reported to the developer.
Next, we address the difﬁcult task of ﬁnding the actual system error that causes a particular liveness violation. Unfortunately, the initial state where the system became non-live very likely has little to do with the error; rather, a much later operation moves the system into a dead state from which recovery becomes impossible. To limit the tedium of wading through hundreds of states to ﬁnd the error, we developed an algorithm to isolate the critical transition—the point after which the system can never recover—likely to be the source of the error, an interactive debugging tool, M DB that allows forward and backward stepping through global events and per node state inspection, and automatically generated event graphs to visualize system behavior.
We have built a fully functional model checker,
M ACE MC, for analyzing safety and liveness properties of unmodiﬁed, complex systems implementations. We have run M ACE MC on a number of systems, including a reliable transport, an implementation of PASTRY [30], and an overlay tree. All of these systems are mature pieces of code with the easy bugs previously eliminated through live deployment. We have used M ACE MC to ﬁnd 33 bugs across these systems. After developing M DB and event graph visualization, we reduced the human time required to ﬁnd and ﬁx liveness violations from a few hours to 1020 minutes.

2

uses a liveness predicate P as well as some safety conditions. The live states are consistent, i.e., satisfy P. Indeterminate states do not satisfy P but may after some further execution. Dead states do not satisfy P and can never in the future recover to a live state. Finally, unsafe regions violate some safety property; a system should never enter an unsafe region and it is usually easy to verify whether it has done so. In this ﬁgure, the model checker performs a bounded depth ﬁrst search (BDFS) starting from some initial state at the center of the ﬁgure. The checker could ﬂag the indeterminate states at the periphery of its exhaustive search as potential violations. Unfortunately, most of the states at the periphery of this search are indeterminate, as it often takes the system many more steps than the maximum search depth to satisfy the liveness condition.
Instead of checking that the system reaches a live state eventually, we could restrict ourselves to checking timebounded liveness properties, wherein we require that a live state be reached within some bounded number of transitions. For our target systems, the number of transitions required to reach the live state is often quite large. For example, many messages may need to be exchanged before all nodes join an overlay. Thus, for a bounded liveness violation to correspond to a real liveness violation, the time bound must be large enough, often thousands of transitions even for distributed systems consisting of two nodes, to give the system sufﬁcient opportunity to reach a live state and to prevent a large rate of false positives corresponding to paths not given sufﬁcient opportunity to achieve liveness. Of course, it is infeasible to exhuastively search to these required depths.
To further highlight the relationship between safety and liveness violations, consider the following real example.
A liveness property for an overlay tree may state that eventually all nodes form a spanning tree. This sufﬁciently general property applies to a variety of tree construction protocols. However, the liveness condition that the nodes form a tree may be violated for long durations, e.g., at system startup and as a result of failure, and thus, cannot be expressed as a safety property. Any safety checker
[15, 27], would have to ﬂag states not satisfying the liveness condition as violations, which would result in a ﬂood of false positives, and the majority of those states would be able to recover to a live state.
We ran our model checker on an implementation of an overlay tree to ﬁnd violations of this liveness property. We found a path to a dead state characterized by two disjoint trees that have no hope of eventually forming a spanning tree. This error resulted from a recovery timer ﬁring during the unanticipated condition that the node was busy joining a potential parent, after which the remained unscheduled. The unscheduled recovery timer left no mechanism to merge the disjoint trees. After understanding this error, we realized a new per-node safety property: the re-

Background

Model checkers ﬁnd bugs by isolating executions that violate some desired property. Previous efforts to model check systems implementations have found bugs corresponding to violations of safety properties such as null-pointer dereferences and deadlocks. Safety properties specify undesirable conditions that must never occur, rather than specifying what the system ought to be doing.
Thus, while acknowledging that safety violations provide valuable checks for known error conditions, we argue that ﬁnding liveness violations is more useful and more fundamental to building robust systems. That is, liveness properties often more easily and succinctly capture the global, system-spanning correctness requirements for systems in steady state; systems may diverge temporarily, but they should achieve the desirable conditions after an intransient condition subsides.
To illustrate the difﬁculty of locating liveness violations, consider the example in Figure 1 corresponding to model checking the state space of a system. This example
2

covery timer must always be scheduled when the node is in a non-initial state. Using this safety condition, the model checker located two additional errors.
In our experience, ﬁxing a global liveness violation often suggests additional per-node safety conditions. These safety conditions help isolate yet more errors, but typically would not be obvious a priori. We believe that an attractive model for identifying errors starts by writing highlevel liveness properties for desirable system states in the limit. Errors resulting from these violations often provide the insight necessary to add safety properties. While liveness properties are more convenient to specify, ﬁnding and safety violations is typically easier because safety properties often do not involve complex global state and because the error is usually close to the operation that triggers a safety violation.
While previous work shows how to model check liveness properties of mathematical system descriptions (typically compiled into ﬁnite-state models), we develop new techniques to ﬁnd and ﬁx liveness errors in the implementations. System implementations contain not only ﬂaws in high-level algorithms or protocols, but also subtle errors due to speciﬁc implementation choices. A software model checker capable of ﬁnding both safety and liveness violations requires addressing several key challenges. We outline these challenges here before discussing our solutions in subsequent sections.
How to systematically explore all executions of implementations? The model checker must read a system implementation and systematically explore all executions.
The system to be analyzed spans multiple nodes executing concurrently and communicating via messages sent over a network. Each node executes an entire computation stack that interacts with a networking layer below and an application layer above. We must isolate all points where the code makes a choice based on a non-deterministic or random input, and systematically explore “what might have been” had the system made different choices. For distributed systems, sources of non-determinism include different ways in which: i) scheduling affects the order nodes execute, ii) applications interact with the system, iii) the physical network delays or drops messages, iv) scheduled timers ﬁre, v) nodes fail, vi) the system makes random number requests, and vii) all of the above may be interleaved. We require a way to expose to the checker the sources of non-determinism, and a technique to systematically explore the multiple potential resolutions to individual sources of non-determinism, and thus, the ways the system may execute.
How to mitigate state explosion? By searching all resolutions to sources of non-determinism, a model checker systematically explores different executions corresponding to the temporal interleaving of events. The space of all possible event orderings grows exponentially with the

Figure 1: Exploring a system’s state space for safety and liveness violations. depth of the execution and the number of nodes in the system. This state-explosion problem can severely limit the applicability of model checking to short executions of small distributed systems. However, many subtle problems only manifest after the system has run sufﬁciently long to enter “steady” state (e.g., after all nodes have initially joined). Finding such bugs deep in the state space requires techniques to mitigate the state-explosion problem and to prune either uninteresting sequences or sequences that have already been explored.
How to tell the time? Distributed systems make extensive use of time, for example, to determine the latency between two nodes (timestamps), to tell if another node is still alive
(timeouts), to delay performing certain tasks (timers), or as an easy source of monotone values (ordering). A model checker cannot use the actual implementation of the time system calls when analyzing the code because it would make replaying an error path impossible as subsequent runs will return different values. Furthermore, by using actual time the model checker would be restricted to exploring only states corresponding to the particular values of time in a given run.

3 Model Checking with M ACE MC
M ACE MC runs code written in the M ACE programming environment [2]. We brieﬂy outline the aspects of M ACE relevant to model checking. M ACE introduces syntax to structure each node as a state machine with transitions corresponding to events such as the message reception, timers ﬁring, etc. Most of a M ACE implementation consists of C++ code in appropriately identiﬁed code blocks describing system state variables and event handlers. The
M ACE compiler outputs C++ code ready to run across the Internet. M ACE MC model checks this resulting C++ code. Our approach should thus be general to any system implementation and could be incorporated into other software model checkers. In this paper, we focus on apply3

ing model checking to distributed systems where complex protocols and asynchronous environments make model checking, especially for liveness properties, particularly valuable. Relative to state-of-the-art model checkers [15, 27], leveraging M ACE code principally gives us the advantage of negating the need to modify source code to isolate the execution of the system, e.g., to control network communication events, reasoning about time, and other sources of potential input to the system. We designed
M ACE with model checking in mind, including programming constructs that allow M ACE-implemented systems to be model checked without modiﬁcation, dramatically improving the ability of the typical programmer to leverage the beneﬁts of model checking. Note however that our techniques and algorithms are general to a variety of software model checkers.

One beneﬁt of M ACE is that the programmer need not painstakingly isolate the sources of non-determinism and replace them with these special calls. M ACE provides language support and common libraries for network communication and scheduling timers, two of the principal sources of non-deterministic asynchrony in distributed systems. For example, when checking a UDP message queue to determine the next event, there may be a variety of message orderings possible that could affect the behavior of the system. Our simulated network layer calls
Choose with the number of available messages as input and uses the returned value to decide which message to process next. Our modular decomposition moreover allows M ACE MC to investigate how the system would behave on networks where, for example, many packets are lost or reordered without having to ﬁnd a real network with those properties. To explore the robustness of the target code, M ACE MC may choose to drop certain messages or reset socket connections with some given probability.
3.1 Systematically Exploring Executions
A simple driver application that initializes the system and carries out some task comprises the ﬁnal component
We take an approach similar to Verisoft [15] and
CMC [27] where we use a driver application suitable for necessary to model check the system. In our experience, simulation and load different simulator-speciﬁc libraries, simple tasks, such as joining an overlay tree and sending speciﬁcally for random number generation, timer schedul- a message from the root to all participants, are sufﬁcient ing, and message transport, so that our model checker can to uncover a signiﬁcant number of bugs because of the explore a variety of event interleavings for a particular variety of corner cases and failure conditions exercised by system conﬁguration and input condition. Possible event the model checker. interleavings include the relative ordering of: i) executions
Algorithm 1 MaceMC Simulator of individual nodes in the system, ii) application events,
Reset event simulators iii) message delivery, and iv) timers and other schedulReset Nodes ing events relative to message delivery. To exhaustively
Initialize Choose with next sequence explore different event interleavings of the system, we for step = 0 to maxstep do must ensure that the model checker controls all sources
Signal divergence monitor thread of non-determinism in the system. Hence, we ensure that readyEvents = set of pending node, event pairs the system calls Choose, the standard M ACE interface if readyEvents is empty then to randomness, when making a non-deterministic choice. break; The result from Choose (described below) determines node, event = readyEvents[Choose(|readyEvents|)] the next branch in the execution. The model checker exeSimulate event on node cutes the actual system code to the next non-deterministic if a safety property is not true then event decision point, and by iteratively feeding the system return S AFETY V IOLATION F OUND different sequences of values at the choice points, it can else if all liveness conditions satisﬁed then exhaustively explore executions corresponding to differbreak ent event interleavings in a depth ﬁrst manner. else if a duplicate state-hash reached then
When running across a real network, Choose simply break makes a call to drand48(). While the model checker if step > maxstep then performs a search, Choose deterministically returns the return I NDETERMINATE S TATE F OUND next number in the search sequence by picking a number from the input range. The Choose implementations used by the model checker internally record the sequence
M ACE MC searches execution paths in a loop that eiof values returned during the search and random walk. ther runs until it ﬁnds an error or has searched a speciﬁed
By saving this sequence, M ACE MC can track which se- number of paths. Each iteration of the search loop exequences to subsequently explore and can replay a given cutes the pseudo-code in Algorithm 1. Each system node execution path by pre-loading Choose to return a speci- runs as a separate object in the single model checker proﬁed number sequence. cess. M ACE programs are fundamentally event-driven, as4

sociating a block of code with each event. Event handling code must run to completion without blocking. For instance, an application wishing to send a message would queue the message with the M ACE transport rather than making a blocking socket call. Similarly, the M ACE transport periodically selects available sockets to ensure non-blocking I/O calls. Thus, at each simulator step, after calling Choose to select a node, event pair from the set of nodes with pending events, M ACE MC executes the code associated with the selected event on the selected node. The event dispatch loop operates in the same manner when running the code live: competing threads determine the next pending event by acquiring a lock and invoke the appropriate handler.
A buggy M ACE event handler may not complete in a timely manner. For example, it may deadlock or enter an inﬁnite loop. M ACE MC sets a separate timer before invoking each event handler. If the timer ﬁres before the event handler completes, we signal a divergence, a liveness violation where the system enters a state where it cannot make forward progress because a single node remains stuck in an event handler.

3.2

state variables and the contents of all pending, systemwide events. The programmer may optionally annotate
M ACE code to ignore the value of state variables believed to not contribute meaningfully to the uniqueness of global system state. M ACE MC checks the hash of a newly-entered state against all previously state hashes. If it ﬁnds a duplicate hash, it breaks out of the current path execution and begins the next path.
Stateless Search. Once M ACE MC encounters a duplicate state or reaches its target maximum depth, it performs backtracking to exhaustively search all possible event orderings depth ﬁrst. CMC stores the necessary program heap, stack, and register state for each intermediate state so that it can restore the program to an intermediate point for backtracking. M ACE MC, similar to the approach taken by Verisoft, performs backtracking by reexecuting the system with the sequence of choices (returned by Choose) used to reach an earlier state before choosing a different, previously unchosen, number at the appropriate point. For example, to backtrack from the system state characterized by the sequence 0, 4, 0 to a subsequent system state characterized by choosing the sequence 0, 4, 1 , M ACE MC reruns the system from its initial state, re-executing the event handlers that correspond to choosing events 0 and 4 before moving to a different portion of the state space by choosing the event associated with value 1.
For re-execution-based backtracking to work, the global state of the model checked system must be uniquely determined by the sequence of numbers chosen by M ACE MC to pick pending events. At any point, rerunning the system from its initial state with the same sequence of chosen numbers should reach the same deterministic global state. This approach has the disadvantage of additional CPU overhead for re-executing system states previously explored. However, it has the advantage of being simpler to implement and not requiring the model checker to store all of the necessary state (stack, heap, registers) to restore the program to an intermediate state.
We have found trading additional CPU for memory in this manner to be worthwhile. Anecdotally, CMC runs out of memory before it runs out of CPU time [27]. M ACE MC uses only tens to hundreds of megabytes of memory to explore tens of millions of paths. CPU time has not yet proven to be a limitation to isolate bugs for M ACE MC.

Mitigating State Explosion

Structured Transitions. The event-driven, non-blocking nature of M ACE code signiﬁcantly simpliﬁes the task of model checking a M ACE implementation, and improves the effectiveness of the technique. In the worst case, a model checker would have to check all possible interleavings of the assembler instructions across nodes with pending events, which would make it impractical to explore more than a few hundred lines of code across a small number of nodes. Non-blocking, atomic event handlers in
M ACE allow M ACE MC to use these code blocks as the fundamental unit of execution. Once a given code block runs to completion, we return control to M ACE MC. At this point, M ACE MC checks for violations of any safety or liveness conditions based on global system state.
Checking for a safety property violation simply entails checking if the state fails to meet the safety condition. In
§ 4 we describe how M ACE MC ﬁnds liveness violations.
Upon ﬁnding a violation, M ACE MC outputs the sequence of events that led to the violation and then terminates. The programmer can then map this sequence to an ordered list of events across all nodes in the system to diagnose the bug. In § 5 we describe M DB, a tool that allows the programmer to step through the execution of the distributed system and to visualize the interaction of system-wide events. State Hashing. When the code associated with a particular event handler completes without a violation, M ACE MC calculates a hash of the resulting system state. This state consists of the concatenation of the values of all per-node

3.3 Managing Time
Model checking concurrent systems requires paying special attention to how system behavior might be affected by time. Hence, M ACE MC requires a mechanism to represent time abstractly enough to permit exhaustive exploration, yet concretely enough to only analyze feasible executions. In addition, M ACE MC requires that executions
5

be deterministically replayable by supplying an identical sequence of chosen numbers for all non-deterministic operations, including calls to gettimeofday.
Existing software model checkers [15, 27] return a random, constant, or monotonically increasing value for all calls to gettimeofday. Unfortunately, these approaches result in both incorrect pruning of valid states and unnecessary exploration of states that cannot occur in practice. One straightforward and correct option would replace all calls to gettimeofday with a call to Choose to generate an arbitrary 64-bit integer. With this approach however, the size of the state space would explode to an unmanageable size early in the search, making bug ﬁnding impossible.
We observed that most systems tend to use time in one of two ways: i) to manage the passage of real time, e.g., to compare two timestamps to determine whether a timeout should occur, or, ii) to export the equivalent of monotonically increasing sequence numbers, e.g., to uniquely order messages originating from a single node. Given these observations, we address the problem of managing time at the language level, by introducing two new M ACE object primitives: MaceTime and MonotoneTime. The programmer uses these constructs to both obtain and compare time values. When running across a real network, both objects are wrappers around gettimeofday. However, model checker uses special handling for these constructs.
M ACE MC treats every comparison between MaceTime objects as a choice point and explores both true and false branches. For MonotoneTime objects, M ACE MC maintains sufﬁcient state to order values returned by successive calls, enabling us to perform the comparisons precisely and thus only explore feasible executions.

4

1
0.9

Fraction of paths

0.8
0.7
0.6
0.5
0.4
0.3
0.2

MaceTransport
RandTree
Pastry

0.1
0
10

100
1000
Simulator steps to liveness

10000

Figure 2: CDF of simulator steps to a live state at a search depth of 15. Note that this only includes those paths which were not duplicates at that search depth, and so likely does not reﬂect the distribution of all paths at this depth.

sight behind our work is that we can restrict our attention to the subset of liveness violations where the system enters a dead state from which no subsequent bounded execution satisﬁes liveness conditions. Our experience shows that there are many interesting liveness bugs in this subset and that the rate of false positives in reported liveness violations is low (no reported false positives thus far).
Thus, we ﬁnd liveness violations by ﬁnding executions that reach a dead state. Unfortunately, it remains difﬁcult to tell if a state is dead or not – most indeterminate states where the liveness condition does not hold eventually converge to a live state. Figure 2 illustrates the depth necessary to reach the ﬁrst live state along an execution. We ran M ACE MC over three systems using random walks to sample the state space beyond an exhaustive search to 15 steps. The ﬁgure plots the fraction of paths that reached the ﬁrst live at a given depth. M ACE MC found only 60% of paths reached a live state for M ACE T RANSPORT after considering 50 steps (the edge of what can be exhaustively searched using state-of-the-art model checkers), less than
1% of paths for R AND T REE, and none of the paths for
PASTRY.

Liveness Violations

Many high-level properties of distributed systems are liveness properties that stipulate eventual desirable and consistent system conﬁgurations. Finding liveness violations poses the challenge that any single state may temporarily be inconsistent; the system may spend some number of steps repairing itself to a desirable conﬁguration.
The classical deﬁnition of a liveness violation is an inﬁnite execution, along which the system never reaches a desired consistent state. Typically, we only restrict violations to those inﬁnite sequences that satisfy fairness constraints that stipulate, for example, that along the execution, the network does not drop all packets, or that a node gets scheduled inﬁnitely often. To ﬁnd such violations, we would have to run our checker to inﬁnity because the standard method, applicable in the ﬁnite-state case, of ﬁnding non-live fair cycles does not apply to inﬁnite-state systems implementations. Faced with this challenge, one in-

The remainder of this section details our algorithm for ﬁnding liveness violations. At a high level, we exhaustively search up to some depth and then determine which of the indeterminate states at the periphery of the search are dead. From each indeterminate states, M ACE MC performs a random walk to a much higher maximum depth
(e.g., tens to hundreds of thousands of events). If the system recovers from the liveness violation before reaching the maximum depth, we conclude that the indeterminate state would eventually recover and mark it as live.
6

4.1

Finding the Critical Transition

If M ACE MC reaches the maximum random walk depth without entering a live state, then the path meets one of two cases:
C1 The path in fact hit a dead state, and the system will never recover. The path should be brought to the attention of the programmer to locate and ﬁx the error.
C2 M ACE MC did not perform a random walk long enough to ﬁnd an event ordering that would allow the system to recover, i.e., M ACE MC must search deeper into the state space to disambiguate between an indeterminate, dead, and live state.
Before discussing how we distinguish among the two cases, let us assume that we know M ACE MC has found a dead state (C1). The programmer now faces the daunting and time consuming task of wading through a path consisting of tens of thousands of events to isolate the protocol or implementation error that permanently transitioned the system to a dead state. Recall that while the system may enter an indeterminate state early, typically a much later critical transition ﬁnally pushes the system into a dead state. After attempting this process manually for a number of errors, we set out to develop an algorithm to automatically locate the critical transition. Importantly, this same procedure also identiﬁes, with high probability, the second case that requires more investigation and the third case that represents a false positive.

Figure 3: Illustration of FindCriticalTransition algorithm with k = 3. The solid line at the bottom is the path p which never reached a live state from steps dinit through dmax . Steps E1 −E5 correspond to states picked in the exponential search phase, and steps B1 −B3 show states picked during the binary search phase.
From each picked state, we perform up to k random walks up to a depth of dmax ; walks terminating before dmax reached a live state, shown as a black box. We mark the sample critical transition with the gray bar at Step 12.

mum depth dmax . The function Recovers(p, i, k) performs k random walks starting from the ith state on the path p out to the depth dmax and returns TRUE if any of these walks hit a live state, indicating that the ith state eventually recovers (and hence, is not dead); and FALSE otherwise, indicating that (with high probability) the ith state is dead.
In the ﬁrst phase, M ACE MC doubles dcurr until Recovers indicates that dcurr is dead. dmax places an upper bound on the critical transition, and the known live state dprev serves as a lower bound. In the second phase, M ACE MC
Algorithm 2 FindCriticalTransition performs a binary search using Recovers to ﬁnd the critical transition as the ﬁrst dead state dcrit between dprev
Input: Path p non-live from step dinit to length dmax and dcurr (with high probability). If we perform k random
Input: Number of Random Walks k walks from each state along the path, then the above proOutput: Critical Transition dcrit , Case C1 or C2 cedure takes time O(k · dmax · log dcrit ).
{Phase 1: Exponential Search} dcurr = dinit /2
Figure 3 depicts the operation of the algorithm. ﬂag = FALSE
Searches that terminate before reaching dmax in a live repeat state indicate that the critical transition must lie further dprev = dcurr down the target path. The critical transition took place bedcurr = 2 × dcurr tween B3 and B1 in this example. A random walks from ﬂag = Recovers(p, dcurr , k)
B3 encountered a live state whereas all random walks until not ﬂag or dcurr > dmax /2 from B1 terminated without achieving liveness. if dcurr > dmax /2 then return (dcurr ,C2)
In addition to the full path that left the system in a dead if dcurr = dinit then return (dcurr ,C2) state, dcrit , we also present to the programmer the event
{Phase 2: Binary Search} sequence that shares the longest common preﬁx with the loop dead path that wound up in a live state (e.g,. a random dcrit = (dprev + dmax )/2 walk from B3 that achieved liveness in Figure 3). In our if (dcrit = dcurr ) then return (dcrit ,C1) experience, the combination of knowing the critical tranif Recovers(p, dcrit , k) then dprev = dcrit sition and comparing it to a similar path that achieves liveelse dcurr = dcrit ness is invaluable in ﬁnding the actual error.
Two interesting corner cases arise in the
Algorithm 2 shows our two-phase method for isolat- FindCriticalTransition algorithm. The ﬁrst case ocing the critical transition. It takes as input the path p from curs when dcurr > dmax /2. In this case, we deduce that the initial random walk, which from step dinit onwards as the critical transition doesn’t appear early enough, never reached a live state even after executing to the maxi- the system was not given enough opportunity to recover
7

during the random walk. In this case, we deduce that if there was a critical transition, it happens after dmax /4. To be certain that the path in fact hit a dead state, we want to ensure that the system is given at least 3/4 × dmax steps to recover, and so in this case, we deduce that the random walk was too short and we return that case C2 holds, we raise dmax and repeat. Thus, we deduce that case C2 holds, and we raise dmax , and repeat the random walks. In the second case, we ﬁnd no live paths even when dcurr = dinit ; either the critical transition is at dinit
(the initial indeterminate state), or, more likely, we did not set dmax high enough. The programmer can typically determine with ease whether the system condition at dinit contains a bug. If not, once again we conclude that case
C2 holds, and raise dmax and repeat.
If FindCriticalTransition returns the pair dcrit , C1 , the programmer knows that the path with high likelihood contains a liveness violation and that the critical transition is at position dcrit . Thus, the same technique that assists the programmer in isolating the event that likely caused the liveness violation simultaneously assists in eliminating most false positives. While our approach is probabilistic, to date we have had no false positives and have always successfully isolated the critical transition.

4.2

recover, which in turn slowed FindCriticalTransition by at least a factor of ten in practice.
To avoid these consequences, we modiﬁed Choose to take a set of weights corresponding to the rough likelihood of each event occurring in practice, and Choose returns an event chosen randomly with the corresponding probability. For example, we may prioritize application events higher than message arrivals, and message arrivals higher than timers ﬁring. In this way, we bias the system to search event sequences in the random walk with the hope of reaching a live state sooner, if possible, and making the error paths easier to understand. We have found that our biasing reduces the dmax required to reach a live state by a factor of 10 to 100 and correspondingly reduces the search time to ﬁnd bugs.
Resolving whether an indeterminate state requires allowing sufﬁcient time to determine whether the system will eventually resolve the violation or whether it is stuck in a dead state. Figure 4 shows that the average depth before reaching a live state from an indeterminate state using unbiased random walks is larger than appropriately chosen biased walks. This ﬁgure plots the cumulative distribution of steps to live paths for a simple 2-node overlay tree (see §6 for protocol details).
Biasing the random walks to common sequences may run counter to the intuition that a key advantage of model checking is the ability to push the system into corner conditions difﬁcult to predict or reason about. However, recall that we run random walks only after performing exhaustive searches to a certain depth. Thus, the states reached by the periphery of the exhaustive search encompass many of these tricky corner cases, and the system has already started on a path leading to—or has even entered—a dead state. One downside to this approach is that it leaves the programmer to set the relative probabilistic weights for competing event types. These settings may be difﬁcult to match to realistic deployments and in fact would change from one system to another. In our experience, however, there has been a straightforward rough weighting of the relative probability of certain events. Further, the reductions in average depth before transitioning to a live state and the simplicity of the suspect paths have been worthwhile. If setting the weights proves challenging for a particular system, it is straightforward to run M ACE MC with unbiased random walks.
Further, a particular weight assignment may in fact bias the simulator away from a sequence of events, however unlikely, that would be able to transition the system back to a live state. In this case, M ACE MC would report false positives because it would conﬁrm the presence of a liveness violation that could have been invalidated by pure random walks. However, by setting dmax to be large enough, these false positives can be eliminated.

Random Walks

The function Recovers performs long random walks to ascertain if a given indeterminate state can in fact recover to a live state. If none of the k walks state reach a live state, we conclude that the initial state is dead. The intuition behind this conclusion realizes that in any robust system, if recovery is possible, then the recovery will complete in a relatively short sequence of operations. By performing a long random walk, we give the system ample opportunity to perform the short sequence of events leading to recovery. In the course of a long walk, with high probability any such sequence would be executed; thus if the system never becomes live, we deduce it is likely that no such sequence exists, i.e., we started with or hit a dead state.
To perform this walk, we use the Procedure MaceMC Simulator in a mode where the Choose function returns a randomly chosen input value. We found that choosing among the set of all possible actions with equal probability had two undesirable consequences.
First, the returned error paths had unlikely event sequences that obfuscated the real cause of the violation.
For example, we had a sequence where the same timer ﬁred seven times in a row with no intervening events—a sequence that could never occur in reality. Second, these sequences of unlikely events slowed the system progress and the random walks took much longer to reach live states. We responded by setting dmax to a large value to ensure that we had allowed the system enough time to Fairness. Liveness properties are traditionally checked
8

system. Further, failures in these initial events do not have the opportunity to exercise “bad” system conditions be1 cause these failures simply look like an aborted join oper0.9
0.8
ation (e.g., resulting from dropped messages) followed by
0.7
a retry.
0.6
To ﬁnd violations in steady-state system operation, we
0.5
take the following approach. We run M ACE MC sufﬁ0.4 ciently long to output a number of live paths of sufﬁ0.3 cient length, i.e., paths where all liveness conditions have
0.2
been satisﬁed, all nodes have joined, and the system has
With Biased Random Walks
0.1
Without Biased Random Walks entered steady state operation. Often, these paths require
0
10
100
1000
10000
tens to hundreds of initial events to satisfy liveness conSimulator steps to liveness ditions. To speed the process of reaching an initial good state when replaying a preﬁx prior to searching, we may
Figure 4: Compared performance of random walks with optionally disable node failures to more quickly reach the and without biasing end of the appropriate preﬁx. We then proceed as normal from this preﬁx with exhaustive searches for safety vioover fair system executions. For example, in a messag- lations followed by random walks from the perimeter to ing system where we wish to check that all messages verify and isolate liveness violations. In the search phase, eventually get ACKed, the liveness property may be vi- we re-enable node failures that allow us to consider the efolated along an execution in which the receiving node fect of such interactions during normal system operation. never gets scheduled, and thus never gets to send its ACK, In § 6 we describe a bug found using preﬁx-based search. or along another execution where the network drops every ACK packet. Classically [9], the user prohibits the model checker from considering such executions by spec- 5 M ACE MC Debugger ifying fairness constraints, that force the model checker to only consider those executions where each node gets The M ACE MC Debugger (M DB) assists the programmer to execute inﬁnitely often and messages are not dropped in diagnosing and locating bugs through event graph viinﬁnitely often. These constraints critically impact the sualizations and interactive log analyses. While searchmodel checker’s ability to avoid false positives, but are ing for violations, M ACE MC runs with all system loghard to specify for complex systems. Furthermore, the al- ging disabled for maximum efﬁciency. Once it discovers gorithms for pruning unfair executions only work in the a violation, M ACE MC automatically replays the path with full logging enabled. The resulting log consists of annoﬁnite-state case.
Our random walk-based approach provides a key ben- tations: i) written by the programmer, ii) generated autoeﬁt by eliminating the need for explicitly specifying the matically by the M ACE compiler marking the beginning fairness constraints for liveness properties to hold, but and end of each transition, iii) produced by the simulator nevertheless, with high probability, restricts the checker to runtime libraries, such as timer scheduling, and message consider only those executions that meet the fairness con- queuing and delivery, and iv) generated by the simulator straints. We achieve this by running the walk long enough to track the progress of the run, including random numto ensure that the probability of an unfair execution drops ber requests and results, the node simulated at each step, to zero, as over time all possible transitions get selected and the state of the entire system after each step. For our often enough with high probability. In other words, in a runs, logs can consist of millions of entries comprising long walk, the probability that the receiver node never gets hundreds to thousands of megabytes of data. selected drops to zero, as does the probability that all sent
While M ACE MC automatically ﬁnds the critical tranACKs are lost. As a result, the random walks ensure that sition, the developer must still understand the sequence
M ACE MC ﬁnds violations that satisfy the implicit fair- of events that left the system in a particular state. This ness constraints and thus correspond to genuine liveness typically involved manual inspection of the log ﬁles and bugs. hand-drawn sketches of evolving system state. To simplify this process, we built M DB to support interactive analysis and visualization of system state across individual nodes
4.3 Preﬁx-based Search and transitions with the added beneﬁt of both forward and
Starting the search from some initial global state suffers backward stepping through a given run. M DB allows the from the disadvantage that M ACE MC may not search sig- programmer to: i) perform single step system execution niﬁcantly past the point where all nodes initially join the both forward and backward, ii) jump to a particular step,
Fraction of paths

Randtree, 8 nodes, with failures, search depth 15

9

Figure 5: Automatically generated event graph for M ACE T RANSPORT liveness bug.

convenience and contains more detail. M DB generates similar event graphs for any target log ﬁle. In step 1, the sender sends a data packet with sequence number 2001 and the SYN ﬂag set. In Step 2, the retransmission timer causes the connection to close and M ACE T RANSPORT signals an error to the application. The application responds by attempting to resend the packet and M ACE T RANSPORT attempts to open a new connection with sequence number 6001. At this point, the old “SYN 2001”, as well as the new “SYN 6001” packets are in ﬂight.
In Step 3, the network delivers the packet for the new
6001 connection, and the receiver replies by sending an
“ACK 6001” message. In Step 4, the network delivers the out-of-order “SYN 2001” message, and the receiver responds by closing the connection on 6001, thinking it to be stale, and opening a new incoming connection for
2001.
Unfortunately, in Step 5, the critical transition, the sender receives the “ACK 6001.” Believing the 6000sequence connection to be established, the sender transmits “DATA 6002,” at odds with the receiver’s view. From here on, the system is dead as the receiver keeps ignoring the “DATA 6002” packet, sending ACKs for the 2001 connection instead, while the sender continues to retransmit the “DATA 6002” packet, believing it to be the sequence number for the established connection.
We illustrate a portion of an M DB session analyzing this bug in Figure 6. We load M DB with the error log and jump to the critical transition step (5) and diff the state with the live path with the longest shared preﬁx as output by M ACE MC while searching for the critical transition (see §4). The excerpt shows the state for the sender node. The key insight from this output is that in the live execution (lines indicated with +), the retransmission timer is scheduled with “SYN 6001”, meaning that the packet could be retransmitted and the receiver could become resynchronized with the sender. Comparing the differences with the dead execution (lines indicated with −) where 6001 has been removed from the inﬂight map and timer because of the ACK, the cause of the bug becomes clear. We ﬁx the problem by attaching a monotonically increasing identiﬁer in the SYN packets, implemented using a timestamp. When the receiver gets the “SYN 2001” message out of order, it deduces from the timestamp that the message is stale and ignores it, allowing it to subsequently ACK the “DATA 6002” message.

iii) select a speciﬁc node and step through events only for that node, iv) list all the steps where a particular event occurred, v) ﬁlter the log based on regular expressions, and vi) diff the states between two steps or the same across different runs by comparing against a second, similar log ﬁle. To further assist in the debugging process, M DB also generates an event graph, highlighting the communication among nodes during the erroneous path. It orders the graph by nodes on the x-axis and simulator steps on the yaxis. Each entry in the graph represents a simulated event, including the transition call stack and all message ﬁelds.
Directional arrows represent message transmissions and other visual cues highlight dropped messages, node failures, etc.
To demonstrate the utility of our debugging tools for diagnosing and ﬁxing errors, we consider a case study with a relatively simple bug in M ACE T RANSPORT, a reliable, in-order, message delivery with duplicate-suppression and
TCP-friendly congestion-control built over UDP. Unlike
TCP, it is fundamentally message- rather than streamoriented, making it a better match to a number of higherlevel application semantics. To obtain lower-latency communication, M ACE T RANSPORT avoids a three-way handshake to establish initial sequence numbers. A key highlevel liveness property for M ACE T RANSPORT is that every receiver should eventually ACK every message sent on all its open connections.
M ACE MC found an execution where a sender attempts 6 Experiences to send two messages to a receiver. M ACE T RANSPORT assigns each message a sequence number. Figure 5 shows We have used M ACE MC to ﬁnd safety and liveness bugs a pictorial version of the event graphs automatically gen- in a variety of systems implemented in M ACE, including a erated by M DB ; the actual event graph is text-based for reliable transport protocol, an overlay tree, and PASTRY.
10

$ ./mdb error.log
(mdb 0) j 5
(mdb 5) filediff live.log
...
localaddress=2.0.0.1:10201 out=[ − OutgoingConnection(1.0.0.1:10201, connection=ConnectionInfo(cwnd=2, packetsSent=2, acksReceived=1, packetsRetransmitted=0),
−
inﬂight=[ 6002 → MessageInfo(seq=6002, syn=0, retries=0, timeout=true) ],
−
rtbuf=[ ], sendbuf=[ ], curseq=6002, dupacks=0, last=6001)
+ OutgoingConnection(1.0.0.1:10201, connection=ConnectionInfo(cwnd=1, packetsSent=1, acksReceived=0, packetsRetransmitted=0),
+
inﬂight=[ 6001 → MessageInfo(seq=6001, syn=1, retries=0, timeout=true) ],
+
rtbuf=[ ], sendbuf=[ MessageInfo(seq=6002, syn=0, timer=0, retries=0, timeout=true) ], curseq=6002, dupacks=0, last=0)
]
in=[ ]
−timer([dest=1.0.0.1:10201, msg=MessageInfo(seq=6002, syn=0, retries=0, timeout=true)])
+timer([dest=1.0.0.1:10201, msg=MessageInfo(seq=6001, syn=1, retries=0, timeout=true)])
...

Figure 6: M DB session. Lines with differences are shown in italics (− indicates the error log, + the live log), with differing text shown in bold. Node 0 has IP address 1.0.0.1 and Node 1 has 2.0.0.1.
System
MaceTransport
RandTree
Pastry
Overcast
BrokenTree
Simulator
Totals

Table 1 summarizes the bugs found with M ACE MC to date. This includes 39 bugs found in 4 systems, 1 toy protocol, and simulator components. Spanning the three systems, the 39 bugs across 2000 lines of M ACE code corresponds to one bug for every 50 lines of code. M ACE MC actually checks the generated C++ code, corresponding to one bug for every 250 lines of code. In the only comparable check of a complex distributed system, CMC found approximately one bug for every 300 lines of code in three
Table 1: Summary of bugs found for each system. LOC=Lines versions of the AODV routing protocol [27]. Interestingly, of code and reﬂects both the M ACE code size and the generated more than 50% of the bugs found by CMC were memC++ code size. ory handling errors (22/40 according to Table 4 [27]) and all were safety violations. The fact that M ACE MC ﬁnds
M ACE MC found several deep bugs in each system that nearly the same rate of errors while focusing on an encaused violations of high-level liveness properties. All of tirely different class of liveness errors demonstrates the these violations are beyond the scope of existing software complementary nature of the bugs found by checking for model checkers because the errors manifested themselves safety versus liveness violations. at depths far beyond what can be exhaustively searched. PASTRY is a key-based routing protocol [30] where nodes
Beginning with an exhaustive search, M ACE MC unearths form a ring with pointers to their successor and predecespotential dead states, and long biased random walks allow sor. PASTRY routes messages between nodes in log n hops
M ACE MC to successfully distinguish the dead states from (where n is the number of nodes in the network) by mainthe indeterminate ones that could still recover. taining for each node a leafset of other nodes closest to it
M ACE MC proved to be extremely valuable for ﬁnd- in the key space, and a routeset corresponding to a routing bugs corresponding to corner cases due to unexpected ing table of other strategically chosen nodes. In PASTRY, event interleavings, e.g., where messages were dropped a node joins the network by locating its nearest neighbor or delivered out-of-order or where nodes failed at incon- in the key space. This neighbor sends the joining node venient times. These conditions are hard for developers to sufﬁcient information to insert itself into the overlay ring. both reason about and to conveniently and exhaustively PASTRY nodes remove other nodes from their leafset and test on LANs. They do occur in practice when running routeset upon detecting a failure. Nodes maintain active across the wide area, but remain rare and difﬁcult to re- connections to nodes in their leafset for quick failure deproduce. All of the tested systems are mature pieces of tection. A lightweight periodic probing protocol detects code run across real networks, for years in the case of failures of routeset nodes.
R AND T REE. In a number of cases, we had seen symptoms Property. An important correctness requirement is that all of these bugs when running across PlanetLab but were un- nodes should eventually be connected in a single ring. able to reproduce them because of the complex sequence This is captured by requiring that eventually the sysof event orderings required to trigger the error. tem reach a state where, for each node, the set of nodes
Bugs

Liveness

Safety

LOC

11
17
5
1
2
3
39

5
12
5
1
1
0
24

6
5
0
0
1
3
14

585/3200
309/2000
621/3300
422/2400
101/1300

11

reached by transitively following the “successor” pointer includes all nodes in the system.
Violation. Using the preﬁx search mechanism described in
§ 4.3, M ACE MC found an execution where node A was unable to ever re-join the ring. We would not have been able to ﬁnd this bug without our preﬁx technique because it only manifests after the system has initially established liveness, and it is impractical to exhaustively search to the point where the system ﬁrst achieves liveness, even for two-node pastry rings. At the end of the live preﬁx, the nodes were connected in a ring and A belonged to C ’s leafset. In our search from this preﬁx, A fails just after the ring forms. However, before C can detect this failure, it sends B a “LeafSetInfo” message informing B of the latter’s proximity to A. This causes B to optimistically add
A to its routeset. Upon restarting, A tries to rejoin the network. To do so, A attempts to ﬁnd its nearest neighbor by sending a “Join” message to B . At this point, B uses its incorrectly refreshed routeset table to route the “Join” message back to A as no node is closer to A’s node identiﬁer than A itself. Moreover, as A is now live and communicating over the network, B detects no further socket failures of A, which prevents it from removing A from its routeset. Forever, from this point on, A’s “Join” messages are sent back to itself, preventing it from successfully joining the ring.
Bug. For this violation, the critical transition found by
M ACE MC involved B receiving the invalid “LeafsetInfo” message from C , sent before C detected A’s failure.
This error was both sufﬁciently obscure and difﬁcult to ﬁx (attempted ﬁxes triggered other errors ﬂagged by M ACE MC) that we decided to check the reference
F REE PASTRY implementation [1] for their approach.
Based on our manual inspection, their code suffers from the same problem. We have not veriﬁed the error with the authors. However, the following log entry in the latest version of the code (1.4.3) suggests that F REE PASTRY likely has a similar problem: “Dropped JoinRequest on rapid rejoin problem – There was a problem with nodes not being able quickly to rejoin if they used the same NodeId. Didn’t ﬁnd the cause of this bug, but can no longer reproduce.”
This violation shows M ACE MC’s ability to ﬁnd not just implementation issues but cases corresponding to underspeciﬁcation in a protocol. The fact that two independent implementations of the same protocol omit handling a rather obscure corner case speaks to the beneﬁts of model checking protocol implementations for liveness. Finally, there is a safety property that could have detected the same error (e.g., a join request should never be routed to a node that is not a member of the ring). However, we feel that the high-level liveness property is more natural, able to capture a variety of speciﬁc error conditions, and more likely to be expressed independently by the developer.
Fix. As an initial ﬁx for the problem, we performed spe-

cial processing for Join messages. In particular, we disallowed an exact match from ever satisfying routing for a
Join message. The goal of course is to prevent the message from being sent to the node initiating the join in the ﬁrst place. However, re-running M ACE MC using the same preﬁx search quickly found a second case where two nodes had reset, and each one’s “Join” message was routed to the other. No unjoined node may bootstrap a joining node because this leads to other liveness violations previously discovered by the model checker. In this case, the critical transition involves receiving an update message causing the two nodes to be added to the routing table of other live nodes.
R AND T REE is an implementation of a random overlay tree with maximum out-degree that is resilient to node failures and can recover from network partitions. This tree forms the backbone for a number of higher-level aggregation and gossip services including our implementations of
Bullet [24] and RanSub [23]. We have run R AND T REE across emulated and real wide-area networks for three years, working out most of the initial protocol errors in the process. Thus, we consider the system we started with for model checking to be relatively mature and robust.
R AND T REE nodes connect to the network by sending a “Join” message to a bootstrap node, who in turn forwards the request up the tree to the root. Each node then forwards the tree randomly downward to ﬁnd a node with available capacity to take on a new child. The new parent adds the requesting node to its child-set and creates a new TCP connection to the child. A “JoinReply” message from parent to child conﬁrms the new relationship.
Property. A critical high-level liveness property for
R AND T REE (and other overlay tree implementations) is that all nodes should eventually become part of a single spanning tree.
Violation. M ACE MC found a liveness violation after several thousand steps. Here, the system permanently enters a dead conﬁguration where two nodes A, D have a node
C in their child-set, even though C ’s parent pointer refers to D. Along the violating execution, C initially tries to join the tree under B , which forwards the request to A.
After accepting C as a child, A adds it to its child-set, establishes a connection to C and sends a “JoinReply” message. While the connection establishment (TCP SYN) is in ﬂight, C experiences a node reset, losing all state. Upon return, C receives the connection and “JoinReply” from
A, but ignores it. Node C then attempts to join the tree but this time is routed to D, who accepts C as a child and adds it to its child-set. Node A assumes that if the
TCP socket to C does not break, the child has received the “JoinReply” message and therefore does not perform any recovery. Thus, C forever remains in the child-sets of
A and D.
Bug. The critical transition for this execution is the step

12

where C receives the “JoinReply” from A. M DB reveals errors. We only ran it with 3 nodes. Two bugs were found that upon receiving the message, C ignores the message in this toy system before we moved onto other systems: completely, without sending a “Remove” message to A.
Along the longest live alternate path found from the state BT1 The ﬁrst bug found was a surprise. The routing API
M ACE uses (based on the API proposed by Dabek et prior to the critical transition, we ﬁnd that instead of real [13]) requires that all nodes processing the message ceiving A’s join reply message, C gets a request from the have their forward handlers called, including the source higher-level application asking it to join the overlay netand destination. The forward handler on the local node work, which causes C to transition into a “joining” mode was not checking to see if it was the sender of the mesfrom its previous “init” mode. In this alternate path, C sage, and as a result, all nodes were forming self-loops subsequently receives A’s “JoinReply” message, and corin the tree, and was therefore unable to form a tree. rectly handles it by sending A a “Remove” message. Thus, we deduced that the bug was in C ’s ignoring of “JoinReply” messages when it is in the “init” mode. We ﬁx the BT2 This was the bug we expected to ﬁnd. After initially forming the tree, the root sends a message to reorgaproblem by ensuring that a “Remove” reply is sent in this nize to its children. They decide to join under the other, mode as well. and because each accepts the other while still under the root, a loop is formed. Note that while this simple case might have been avoided by not accepting a node as a
7 Bugs child whom you have requested to join, the same bug would appear with a longer loop and 4 nodes.
We have applied M ACE MC to ﬁve different systems and have found errors in all of them. The errors we describe are quite complex, and indicate both bugs in the implementation of the systems as well as holes in the speciﬁca7.2 Overcast tion of some well-known protocols like PASTRY. We now brieﬂy describe each of these protocols and the bugs that The M ACE implementation of the OVERCAST protocol
M ACE MC has found in them. follows most of the basic algorithm from B ROKEN T REE, with a few exceptions. For example, in OVERCAST, bandwidth probes are conducted in the form of sending a num7.1 BrokenTree ber of messages to another node to time how long it takes.
The B ROKEN T REE protocol was devised as a gross sim- Next, nodes move based on the result of the probe, where pliﬁcation of the original OVERCAST [22] protocol. When the criteria for moving is that a node rejoins under its published, this protocol’s description did not account for grandparent if it gets sufﬁciently more bandwidth from the possibility of multiple nodes nearby in the overlay re- them than it does from its current parent. Otherwise, the conﬁguring at the same time, and ending up forming a node rejoins under one of its siblings as long as the overall throughput from the root does not decrease sufﬁciently. loop by simultaneously selecting parents in a cycle.
In this simpliﬁcation, nodes join by sending a message Finally – to avoid the kind of bug found in BT2, the curto the root of the tree. The root is taken to be the node rent OVERCAST implementation uses a kind of distributed passed in by the application on the request to join the over- lock. A node locks anyone it is going to probe, preventing lay (joinOverlay() ), which it does after initializing the ser- them from probing other people or moving in the tree, vice (maceInit() ). Upon being asked to forward the join which prevents the possibility of nodes forming a join cymessage, any node will either accept the sender as a child, cle. or pick a random node to forward the join request to, if it
For the purposes of checking the implementation ushas too many children. ing M ACE MC, the decision algorithms were modiﬁed
Additionally, periodically all nodes will send a reor- to not look at the time. The probes, however, were still ganization message to all of their children, containing a conducted as in the real system – just the times themlist of the children (their siblings) and its parent (their selves were not being used. Additionally, the state varigrandparent). When receiving this message, nodes ran- ables which were used to store the times of the probes domly decide to reorganize. When they decide to reorga- were given a nodump attribute to tell the compiler to leave nize, they randomly pick from their brothers and grand- them out of state space serialization and hashing. parent to re-join under. (This process mimics the probe
In a similar setup to the B ROKEN T REE run, using 3 and selection process used in real systems to reorganize nodes and with socket errors turned on, M ACE MC found into a tree optimized in some way). a bug related to the new lock ﬁx for the original bug. This
When testing this system, we ran it only testing for the bug was not a distributed loop (safety) property failure, safety property of no loops, and without allowing socket but rather a failure to form a spanning tree.
13

Feature/Bug
Basic
search depth actual depth paths searched
Hash 1(/Hash 2) search depth actual depth paths searched
RW (/Bias) search depth actual depth paths searched
RW & H1 (/H2) search depth actual depth paths searched
RW & B & H1 (/H2) search depth actual depth paths searched

BT1
56ms
10
11
87
63ms/36ms
10
11/9
55/27
15ms
5
14/22
0
16ms
5
14
0
16ms
5
22
0

BT2
6.5h
20
21
15M
30s/4.5s
20
21
18K/2.5K
0.37s/1.9s
10
25/42
445/2.6K
0.16s
10
41/67
120/55
0.25s/0.14s
10
27/68
213/78

Feature
No Hashing search depth paths searched
Hash 1 search depth paths searched
Hash 2 search depth paths searched

OV1
Unbiased RW Biased RW
76s
51s
15
10
16K
13K
29s
11s
20
20
10K
5K
13s
46s
20
25
5K
16K

Table 3: Model Checking Statistics for the Overcast bug found.
‘RW’ stands for random walk, ‘Hash 1’ for collision resistant state space hashing, and ‘Hash 2’ for collision prone state space hashing. Model checking conﬁgurations without random walks are not included here, since this bug is only seen through liveness testing. Additionally, actual depth is not shown, since all cases reached the simulator’s maximum number of steps (in this case, about 25K random numbers were generated to reach 10K simulator steps). The results indicate that both hashing functions and both biased and unbiased random walks may be useful in different circumstances.

Table 2: Model Checking Statistics for the two BrokenTree bugs found. Basic represents the basic state-space searching compiler. ‘RW’ stands for random walk, ‘B’ for bias, ‘H1’ for collision resistant state space hashing, and ‘H2’ for collision prone state space hashing. Paths longer than search depth without random walks indicate the application requested a random number when at the search depth, and a random number was provided to get back to the simulator loop. The results indicate that both hashing strategies can signiﬁcantly improve the time it takes to ﬁnd bugs, though random walks are frequently more effective at ﬁnding the bug sooner. The biased walks have longer paths and search times due to their tendency to avoid bugs. However, the biased paths are easier to understand and reason about.

including failure of the root of the tree, and the ability to recover from network partitions by periodic probing of nodes passed into the tree on joinOverlay().
Unlike OVERCAST and B ROKEN T REE, it is perfectly valid for R AND T REE to temporarily form multiple different trees, since the root is elected rather than speciﬁed, and trees are expected to merge over time. Accordingly, a set of weaker, more generic tree checks were required for checking R AND T REE. Loops are still invalid, unless they are loops of a single node (roots of trees form self-loops), and it is still the case that eventually, a single spanning tree should form.
OV1 This bug is a result of the distributed lock code
The join procedure for R AND T REE is to contact a added to prevent loops. In this code – there is no time- member of the peer set. (If that times out, round-robin to out of the lock for the nodes being locked by another the next member of the peer set). When contacted, a node node considering moving, and no handling for socket will pass along a join request to the root of the tree. The errors or notice of buffer full errors. As a result, if there root of the tree then either accepts the new child, or pushes is a socket failure or buffer full error causing the loss it down if it doesn’t have room. As a special case, if the of the unlock message, or if the node holding the lock joining node has a lower IP address than the root, the root were to fail outright, the locked nodes would forever re- will rejoin under that node instead, making it the new root. main locked. The liveness check in the model checker This mechanism is designed to let nodes deterministically catches this by a path whose random walk doesn’t ter- agree on a root node. minate, even given ample time.
Periodically, every node will probe members of its peer set (in round robin fashion) to make sure they are in a tree which shares the same root. If the roots do not match – corrective action is taken, asking the larger IP root to join
7.3 RandTree under the smaller IP root.
R AND T REE is another tree implementation available in
M ACE. Unlike OVERCAST, which tries to form a band- RT1 This ﬁrst bug was a divergence. A node’s join timer width optimized tree, R AND T REE simply builds any ranﬁred, which normally causes it to initiate a new join redom tree. Its key features include robustness to failures, quest to the next member of the peer set. However, it
14

skips over any members of the peer set which are its children. However, if it has already accepted its entire peer set as children, but receives a join request from another node with a lower IP address, by the protocol it will attempt to join under them. If that join times out, it would then enter an inﬁnite loop trying to ﬁnd a suitable member of the peer set, but ﬁnding none because it had already accepted all speciﬁed peers as children.
As a ﬁx, in the join deliver handler, in addition to sending a join message to the new node, R AND T REE was modiﬁed to add that new node to the peer set.
RT2 When ﬁrst written, a R AND T REE node, when asked for its parent, would return the null address if it was the root of the tree. Some time later, we modiﬁed the
API to require that tree-roots return their own address as their parent (a self-loop), to distinguish them from nodes which are in the process of joining, and therefore have no parent. As a result, R AND T REE appeared to never form a spanning tree – since no node ever claimed to be the root by that API call.
RT3 The recovery timer (which is what periodically checks to make sure all potential peers share the same root to merge trees) was being manually rescheduled when it ﬁred. However, in the M ACE speciﬁcation for
R AND T REE, the timer ﬁring would only be processed when the node was in the joined state. So if the timer ever ﬁred when the node was in the joining state, it would never again be scheduled. As a result – when peers connect to each other and form distinct trees, they would never merge, and R AND T REE would not form a single spanning tree.
RT4 When the root of a tree changes, a new root message is propagated down the tree to inform peers of a new root. This root value is then used when comparing with probes from other nodes to see if they are members of the same tree. However, in cases where the join timer ﬁred, and as a result multiple peers (temporarily) accepted the same node as a child, that child node may receive a new root message from a node other than the one which would eventually become its “ofﬁcial” parent. However, no check was being performed on the new root message, and so it would update its root ﬁeld accordingly. As a result, when later being probed by members of the other tree, it would mistakenly believe they were all part of the same tree, preventing proper recovery. RT5-9 Five more instances of R AND T REE 1 were discovered. These were other cases where a R AND T REE node decided to join a node other than those listed in its peer set, and when they were all children, it reverted to the same inﬁnite loop. Upon ﬁnding the second path with the same root cause, a ﬁnd was done on the code
15

to ﬁx the remaining four cases without ﬁnding them independently.
RT10 If a node initially accepts another as a child, and then the child later asks to be removed, but the parent node receives the remove message while it is busy joining under another node, it will ignore the remove message. In this case, though the child node will no longer recognize it as a parent – the parent node will forever believe it is a child (short of a network failure or node departure), causing an inconsistent network view.
RT11 RT11 is a safety property violation. The property is that for each node, either the node is in state
RandTree init, or the recovery timer is scheduled (exclusively).
When a node has had maceInit called on it, and receives a probe message from another node, it may attempt to join that node, transitioning to the joining state. This effectively prevents the recovery timer from ever being called. The ﬁx was to ignore probe messages until a node is no longer in the init state.
RT12 RT12 is a liveness bug. It was caused by a remove message being dropped by the sending transport, because the transport buffer was full. Though the send error was immediately reported by the transport, no reaction was implemented or taken by the sending node. As a result, the recovery message was never received, and the kid state never matched the parent state.
In isolating this bug, the most useful tool was the LastNail approach. It correctly reported the problem was step 68. In the end, I had to go to the log, rather than event graph, to discover the problem, because the graph did not report the send error, and displayed only the attempt to send the remove message. This will change, and we will add message ids to the simulated network to give better feedback both in tools and logging about the status and outcome of each and every message handed to the network.
Two possible ﬁxes for RandTree present themselves –
(1) on an enqueue failure, call a method to close and reset the connection, removing the peer from our own state. (2) queue the send for later sending.
The search for this bug was executed using our automated tool for running the search, followed by graph generation and last nail seach where appropriate. The search phase of this took 9.5 hours, searching 2.85 million paths, searching in the end at a depth of 35. Last nail ran in 1.5 minutes.
Also noteworthy, though the last nail was random number 158 at a simulator step of 68, the ﬁrst 35 random numbers were a signiﬁcant contributor to ﬁnding this bug, since they caused a large number of messages to

be queued between the source and destination, which at a higher depth caused the dropped message. I believe this would have been more unlikely/harder to ﬁnd without using exhaustive searching, because random walks will tend not to explore this kind of extremity as reliably.
RT13 RT13 is a new RandTree bug exercised by node failures. The speciﬁc liveness bug occurs when a node
A sends a join message to another node B, who forwards the join to node C (since in RandTree, joins all get forwarded to the root of the tree). Then, before node
C responds to node A, node A reboots. The socket from node A to B is broken, but that does not help node C, who has not yet opened its socket to A. If the timing is such that the response from node C to node A arrives between A being initialized and having joinOverlay called, the message is ignored as A is in the init state. In this case, A joins a fourth node D, and the child pointer from C to A is never ﬁxed. It is not clear whether the presence of the fourth node is necessary.
The bug was found in path number 383365, working on a search depth of 15 (about the 280000th path searched at depth 15). Last nail correctly identiﬁed the last nail as position 18, in which the joinReply from C to A is delivered. In the closest live path, instead, joinOverlay is called on A, preventing the problem.
RT14 RT14 is a safety bug. Speciﬁcally, it is a violation of the timers property, which asserts that at all times, if a node is not in the init state, its recovery timer is scheduled. As with RT11, this bug occurs as a result of transitioning out of the init state based on receiving a message. In this case, it was not the probe message, but instead the probe join message, which is sent to a root node to inform it of another root node to merge two trees. This error is not seen without node failures because a probe join message would only be sent about a live node. A node failure resets this, and so the bug is now found.
This bug was found as path number 1350198, at a search depth of 15, but an actual random depth of 22
(simulator depth of 21). Step 22/21 is also the LastNail step (though LastNail is not functional for property violations), because if instead joinOverlay had been called before receiving the message, things would have worked ﬁne.
As a reﬂection – the timers property was written as a result of the observation that not having the recovery timer scheduled will cause liveness bugs when nodes form distinct trees and do not probe across trees to make sure they get merged. This property was written as a result of liveness bug RT3.
16

RT15-17 RT15 is another safety bug. In this case, after forming a full tree of 6 nodes, the root node dies, followed by node 3. When node 2 gets the network error from node 0, it tries to join node 1 (the new root by minimum IP address). However, before node 1 responds, the join timer ﬁres on node 2. It then tries to join node
3 (because it is looking round-robin in its joinSet), the newly restarted node 3 responds to node 2 with a response of FORCE ROOT, which means that node 3 will soon be joining under node 2, as 3 was a root, but 2 has a smaller address. Node 2 asserts that the msg.root matches node 2’s root value, which fails, because node
2 believes another node is the root. (Speciﬁcally, it believes node 0 is still the root, even though it is joining under node 1. I am not sure whether that itself is a bug
– its a highly complicated subject). As as attempted ﬁx,
I am setting the root back to self on a join timer ﬁring
– since that falls back to old join methods, I decided it was appropriate at that time to reset the root back to self. RT15 was found using the following mechanism.
First, I ran the model checker on RandTree with
PRINT SEARCH PREFIX set to 1, and MAX PATHS set to 20. This caused the modelchecker to output 20 path preﬁxes which lead to live states. I then restarted the modelchecker, using preﬁx1.path as the search preﬁx. This searched 535 paths before ﬁnding this scenario, which required 4 nodes, two of which died (or at least one died, and the other suffered a socket error).
Once I had the error path, our tools created error.log and error.graph. I used mdb and the event graph to ﬁgure out why the assertion had ﬁred. The event graph was the primary tool in observing what happened, while mdb was used to conﬁrm details of theories formed in my head, to make sure I understood speciﬁcally what was happening. The whole process took less than 30 minutes. RT16 and RT17 are also safety bugs, in the vein of
RT15. The ﬁx in RT15 is not sufﬁcient to prevent cases where we get a force root message and the roots do not match. Instead, we now reset the root any time our parent dies. This may prevent some shortcuts during tree recovery, but should be more safe. In this 6 node case, several nodes died, and a node which did not die sent a message to another node, which eventually got forwarded through another couple of hops to a newly restarted node with a higher address, and a root force message was sent to the joiner. One key in this scenario is the root itself has failed. This bug was found using the same preﬁx as RT15, but was found as path 38365, still at the search depth of 5. The appearance of these two bugs also suggests that we should write a safety property that ties together the joining state to the state

of the root pointer.

7.4

Pastry

PASTRY [30] is a distributed algorithm for node location and overlay routing. Every node joining the overlay becomes responsible for receiving and handling messages destined for a range of a 160-bit integer space, localized around an identiﬁer chosen for the node1 . The node location and routing are determined using a decentralized protocol where each node keeps track of a logarithmic subset of the total set of nodes in the system.
Joining requires contacting a bootstrap peer and routing a set of messages towards the node presently managing the identiﬁer space you are to take over, with responses to bootstrap the node’s own state. Joining the overlay is only complete when the node receives routing table entries for each row of its routing table, and the leaf set from its immediate neighbor. After the node has ﬁnish joining, it informs members of its leafset about its presence, and enters the normal operating mode of the protocol, wherein it performs periodic probes and updates to ensure that every node’s routing table is correct.
M ACE MC has found 5 bugs in our PASTRY implementation thus far:
PASTRY 1 The ﬁrst PASTRY bug found was a combination of several factors, which led to the third node of the simulated 3 nodes being unable to join. First, a node joining sends a message to a bootstrap node, who forwards it on towards the present owner of the address.
In addition to forwarding it on each node along the path sends the number of rows to the joinee to which it shares in common a common preﬁx. The ﬁnal destination node sends a copy of all remaining routing table rows, as well as the leafset.
No reliability is applied to messages being sent, and instead it is left to the joining node to time out the join, and re-initiate. When it does so, the joining node sends the ﬁrst row it needs, to prevent duplicate sending of rows already received. However, a bug in the forwarding node caused it to forward the wrong start row, instead always forwarding the number 1. Coupled with the fact that 40 row messages need to be received, and that a typical node will only buffer 20 messages at a time, each time the destination node received the Join message, it began with row 1, failing at by row 21. As a result, the joining node could never successfully join the overlay network.
1 The identiﬁer is frequently chosen by taking the SHA1 hash of the
IP address or hostname, but the method of selecting node identiﬁers isn’t so important as the fact that they are generally “evenly” dispersed in the
160-bit address space.

17

PASTRY 2 The second PASTRY bug illustrates a case where the PASTRY speciﬁcation does not fully explain how to handle the situation. When a node brieﬂy becomes disconnected from its peers (or more generally, a network partition), and all sockets between the nodes break, on each side of the partition all references to the other side are removed, with the intention of reestablishing peering relationships through standard periodic maintenance. However, in the case of a network partition, each partition “forgets” completely about the other one, and periodic maintenance does not resolve it.
As a result, in a 3 node system, M ACE MC found a case where all three nodes form PASTRY rings of size 1.
PASTRY 3 This third bug in Pastry was discovered in a system of 8 nodes, setting the leafset size to 4. It was a liveness failure – namely that the leafset relationship was not reﬂexive. The last nail utility pinned it down to step 355, in which node A received an inform message from node B. However, upon adding node B to A’s leafset, Lmax, which is supposed to be the hash id of the furthest node away in the right half of the leafset, was improperly set to the other node in the right half of the leafset. This occurred because of a bug in MaceKeyDiff’s ≤ operator, which was returning the wrong value.
Speciﬁcally, if the difference between the two keys was greater than half of the address space, it was incorrectly treated as an overﬂow. As it turns out, this same bug had appeared in basic testing (without the modelchecker), and had been ﬁxed for the < operator, but not applied to the other.
Though last nail correctly returned the right step, it took a bit to determine the problem. This was aided by looking at the event graph and M DB, as well as directly viewing the logs. The need to look at the logs would be alleviated by making it possible to compare the state more conveniently. Another thing which will be ﬁxed is to make the debugger be able to display and ﬁlter more than it used to – for example, to compare all leafsets across nodes within a step. (Presently this involves a fair amount of scrolling).
The process of ﬁnding the bug went like this:
1 Last nail is run automatically on any non-live path
2 The event graph is automatically generated for both the error path and the live path found
3 The event graph summarizes the property failure, and can be used to see at a glance the bad step (as well as what worked)
4 Switch to mdb, where I skip ahead toward the end to see the details of the failued property – ﬁnd the nodes which failed the property. I note that one of the two nodes happens to be the one event 355
(from last nail) occurred on.

5 Going back to step 355, I see that its the receipt of an inform message. I suspect that maybe the problem is these two nodes got confused about each other, and one doesn’t share the other. Looking back at a later state, I realize that the sender was not one of the two nodes with a problem, though it does still appear in the other node’s leafset.
6 I decide I now need to ﬁgure out the correct leafset. I go to the log (future mdb) to see all nodes values of myhash, and sort them manually. Upon futher inspection, all nodes have their correct leafset, except the one event 355 occurs on. I conclude something must have happened in step 355 which locked that node into its leafset. I use mdb to reinspect 355, to see both the logs for that event, as well as the state of the node at the end.
7 I brieﬂy go back to the logﬁle, where its easier to compare the difference in this node’s state between two steps. (354 and 355). Not seeing anything particularly out-of-the ordinary in this exercise, I look at the state of the node at 355 in more detail. This is the point where I notice that Lmax and Lmin state variables are strangely asymmetric – Lmin is the further away of two nodes, but
Lmax is the closer of them. Inspecting the code, I determine this to be a bug.
8 Looking at the logs, I start tracing the code path executed by recompute leaf bounds, the function which is supposed to update Lmax. I conclude the most likely scenario is that something must be broken about comparing MaceKeyDiff objects. This code is not overly well tested, so that makes sense. Careful inspection (and the logs already generated by the run) allow me to determine the problem – that the comparison is incorrect in these values due to incorrect belief about overﬂow. 9 Finally, analysis of the suspected bug matches what actually happens, afﬁrming the work done thus far. Speciﬁcally, with Lmax containing the wrong value, the larger element of the leafset is never replaced, because no node is closer than
Lmax, and if its further than Lmax it isn’t considered. This would cause the leafset to be permanently incorrect. Considering its severity, it can only happen when a node’s initial leafset is pretty bad – its nearest neighbor, and one further than half the keyspace away. It is therefore less likely in scenarios with more nodes, and with larger leafsets. However, the bug in MaceKeyDiff could have many other subtle effects, which this ﬁxes.
Stats for this bug. It was found in path 5 (the ﬁrst 4 of which were abandoned as duplicates), therefore at a
18

search depth of 5. The last nail step was 355, with the last nail random position at 373.
PASTRY 4-5 This bug in Pastry was another liveness bug.
It was found by applying the preﬁx search to Pastry.
This involves ﬁrst using the modelchecker to generate live sequences for paths, and then starting the search from this point. This allows the modelchecker to get past the joining phase of the protocol, and searching from this point. Since it takes an 8-node Pastry ring on the order of 700-800 steps to reach liveness, this allows it to search where exhaustive search would not reach.
The speciﬁc bug found was one where not all nodes end up joining (note that this implied something happened to un-join one or more of them, since it began from a known live state). The path found was one where node
5, after restarting, cannot rejoin because after node 0
(the bootstrap node) sees the socket error from node 5, but before 5’s second join attempt begins, it receives state from another node (in this case a leafset info message) which mentiones node 5. Though 5 will not be added to node 0’s leafset, it is added to node 0’s routing table. Later, when node 5’s join is received by node 0, node 0 forwards it back to node 5, preventing it from being able to join.
There seem to be two potential ﬁxes for this bug. The ﬁrst is to avoid adding nodes to a routing table without opening a socket to them and probing their liveness.
This in practice however would be prohibitively expensive in practice. Instead, we will likely take the second approach, which involves having a node verify a joining node is not in its routing state (when a join message is received). However, having further considered this approach, it will fail in cases where the join timer ﬁred and additional join message(s) are submitted from the source node. We would not want to risk the possibility that a join removes the state that an earlier join set up. Therefore, we are pursuing a third possibility: when routing a join message, we avoid routing it via the routing table to the destination node. (I further wonder if a leafset could somehow get state for a joining node – will let the modelchecker continue to look for this scenario).
Stats for this bug: It was found in path number 48 of the preﬁx search. The search path began at step 744, and the last nail (the receipt of the leafset info message which caused node 5 to be readded to node 0’s routing table) was in step 771, with a search depth of only
5. It is unknown whether FreePastry suffers from this bug (it may be possible to check if the FreePastry code protects against this). This speciﬁcation of Pastry does not sufﬁciently describe this scenario. It is also possible that Bamboo suffers a similar fate.

A further effect of this bug is a new assertion added to the Pastry code which asserts that you do not recieve a join message that you sent. It still would not point to the true last nail step, but would have shortened the overall search effort.
Pastry5: After applying the proposed ﬁx for this bug, the modelchecker was re-run, and found an instance where the ﬁx caused a new problem where all nodes could not re-join after failure. In this case, two nodes failed simultaneously. Though their socket error removes them from node 0’s state, subsequent periodic updates cause them to be re-added to the routing table for node 0. Although the new code prevents their join messages from being routed back to them, in this case, each’s join message is forwarded to the other node (also not joined). Unfortunately, a node cannot process a join message while it itself is joining.

The ﬁx is that all timers for inﬂight messages must be canceled when closing the connection. We also added a new safety property that would have caught this problem in the ﬁrst place.
This bug was exercised by the receipt of a Reset message, but we believe that it would have also been triggered by a message timeout causing a connection error.
I tracked down this bug primarily using my visualizer.

Pastry5 was found using the same search preﬁx as before. It was found on path number 1952, at a search depth of 5. The last nail was position 821 (receipt of a leafset info message), after the preﬁx ﬁnished at position 744, and the nodes restarted in position 748 and
749.
As yet, a solution has not yet been devised for Pastry5.

7.5

retransmission timer was not canceled for all inﬂight messages associated with a connection that was being closed. When a connection is closed, all of its state is erased. A new connection to the same destination was then opened. When the retransmission timer ﬁred for an old message, it actually executed because the guard thought that the connection was still present as it had been re-opened. The timer then tried to reference the deleted state and caused an assertion to fail.

MaceTransport

M ACE T RANSPORT is a M ACE (user-level) transport service providing reliable, in-order, message delivery with duplicate-suppression and TCP-friendly congestioncontrol. M ACE T RANSPORT is implemented using a basic unreliable UDP transport.
M ACE T RANSPORT 1 M ACE MC has found its ﬁrst
MaceTransport bug. The bug caused the same message to be delivered to the receiver twice. In more detail: 1)
The source sends the ﬁrst message in a new connection
(syn ﬂag is set). 2) The receiver receives and delivers it and sends an ack. 3) Before the source receives the ack, the retransmission timer ﬁres, so the source retransmits the message. 4) The receiver receives the duplicate. Because the syn ﬂag is set, the receiver concludes that the source died and came back, so it erases all state associated with the connection, creates new connection state, and then delivers the duplicate.
The ﬁx that I have implemented stores the initial sequence number for the connection, and if it receives another message with this sequence plus the syn ﬂag, then the transport treats that messages as a duplicate.
M ACE T RANSPORT 2 M ACE MC found its second
MaceTransport bug today. The bug manifested itself as a failed map lookup, where state that had been deleted was referenced. The problem was that the

M ACE T RANSPORT 3 This is the writeup for the liveness/performance bug (number 3) that we discussed today. When syn packets were delivered out-of-order, the receiver may expect different sequence numbers than the sender and no data will be sent or delivered on the connection. This situation will be corrected by the retransmission timer closing the connection with an error, which is why we are classifying it as a performance bug. It was classiﬁed in the modelchecker as a liveness bug because the retransmission timer never closed connections during a random walk due to biased weights.
The ﬁx is to have each sender send a monotonically increasing identiﬁer (implemented as a ﬁne-grain timestamp) in the syn packet. The receiver maintains the most recent identiﬁer from each sender and ignores syn packets with an identiﬁer that is old.
The modelchecker also discovered two errors in the initial attempted ﬁxes. The ﬁrst was that I omitted a necessary check in an if-else clause which caused an out-oforder syn identiﬁer to still be processed. This was ﬁxed by adding the check and clause. The second was that the identiﬁer was initially stored in the incoming connection state, which was lost when the incoming connection was closed or reset. The ﬁx for this was to store the identiﬁer in a separate state variable map that persists across connections.
Finally, I will point out that the modelchecker is extremely valuable for ﬁnding bugs related to out-oforder packet delivery, as it is virtually impossible to test out-of-order packets on a LAN.
M ACE T RANSPORT 4 Out-of-order packet delivery to the transport caused an internal assertion to ﬁre that prevented duplicate message delivery to the applica-

19

tion. The ﬁx is to increment a counter when delivering buffered messages that arrived out-of-order.

M ACE T RANSPORT 10 The Clear-To-Send timer was not being canceled when the connection signaled an error, causing too many CTS callbacks to be made to the application. This was caught by a safety property. For some reason, this bug manifested itself when running with three nodes, but not with only two.

M ACE T RANSPORT 5 Out-of-order packet delivery caused the transport to send an acknowledgment for a future sequence number, which should never happen, instead of the most recently received packet. The ﬁx was correctly setting the ack sequence number when packets arrive out of order.
7.6
M ACE T RANSPORT 6 Out-of-order packet delivery caused the transport to continue to retransmit packets that had already been acknowledged. On receipt of an acknowledgment with a sequence number not found in the inﬂight message map (ie, the sequence number is greater than any inﬂight message), the transport would previously do nothing. The ﬁx is to check if the sequence number is greater than all inﬂight messages, and, if so, have the transport mark all inﬂight messages as acknowledged.

SimulatorTCP

S IMULATORTCP is a the M ACE simulator of a TCP transport service. It simulates sockets with in-order delivery of messages between peers, and application buffering by capping the number of messages it is willing to buffer.
With the exception that it does not use an external thread, this closely mimics the actual behavior of the actual TCP service in M ACE. It also can simulate node failure and socket failure by injecting READ errors at the appropriate places. However, this very tricky additional simulation functionality took a long time to reason about, and so its implementation was feared to contain bugs. On initial
M ACE T RANSPORT 7 Out-of-order packet deliver testing and running of RandTree with failures, the simucaused a receiver to delete buffered data upon receipt lator ran into some assertions placed to notice if things of an ack when the data will need to be retransmitted. were running amok, which were trapped by the simulator.
A somewhat complicated series of events need to occur It then made it possible to explore what the cause of the to demonstrate this performance/bounded liveness assertion was – since the speciﬁc path which caused it was bug (note that the retransmission timer will ﬁx this available. After catching this, we generalized this by codcondition). ing the error model developed into a safety property which
Essentially, things get messed up because when a re- is now checked by the modelchecker explicitly, and found set is received, both incoming and outgoing state get an additional bug this way. erased. This is problematic when resets are received out-of-order, as the incoming connection state is lost, S IMULATORTCP1 This bug was caught by an internal assertion in the SimulatorTCP service before writing even though the sender does not realize this. See the explicit safety properties. In this case, a node reset sigevent graph for full details. nal sent to a remote node was exercising both the node
The ﬁx is be able to close the connection in only a sinreset code and the bi-directional socket error code, due gle direction. On a reset, close the outgoing connection. to a missing break in the switch statement.
On a syn, close the incoming connection. On a timeout, close both directions.
S IMULATORTCP2 This bug was caught by the safety properties written for the SimulatorTCP service. It was
M ACE T RANSPORT 8 Out-of-order plus duplicate packet exercised by a read error being received by a destinadelivery caused the transport to attempt to deliver fragtion node. After the read error was retrieved from the mented message before all the fragments had been resource node, its local state continued to reﬂect that it ceived. The ﬁx is to check if an out-of-order received had an outgoing error, because the code which should message is already buffered, and, if so, do not process have cleared this ﬂag was buggy. The index into the ﬂag it again. vector had been written as the boolean itself, rather than
M ACE T RANSPORT 9 In another performance bug, the the ndoe number. transport was sending an acknowledgments for the most recently received packet, rather than the cumula- S IMULATORTCP3 This bug was caught by the safety properties written for the SimulatorTCP service. It was tive last in-order packet. The ﬁx is to check for out-ofexercised when a peer reset happened when both nodes order buffered messages before sending the acknowlwere in the bi-directional pending error state. The case edgment. The modelchecker did not directly ﬁnd this had simply been improperly reasoned about, and the bug (there were no assertions for this property), but remoteError ﬂag was not being cleared as it should have rather it was discovered while inspecting the event been. graphs for other runs.
20

8

Related Work

Our work is related to several techniques for ﬁnding errors in software systems that fall under the broad umbrella of
Model Checking.
Classical Model Checking. The idea of viewing the a system as a graph of states and transitions between states, and thus of analyzing systems by graph search algorithms goes back to at least to the late seventies [3]. “Model
Checking,” i.e.checking that a system described as a graph
(or a Kripke structure) was a model of a temporal logic formula was independently invented in [11,29]. Advances like Symmetry Reduction, Partial-Order Reduction, and
Symbolic Model Checking have enabled the practical analysis of hardware circuits [4, 25], cache-coherence and cryptographic protocols [14] and distributed systems and communications protocols [20]. The idea of state-hashing that M ACE MC uses was introduced in S PIN. However, the tools described above require the software system being analyzed to be speciﬁed in a somewhat tool-speciﬁc language, using the state graph of the system constructed either before or during the analysis. Thus, while they are excellent for quickly ﬁnding speciﬁcation errors early in the design cycle, it is difﬁcult to use them to verify the implementations of systems.
Model Checking by Random Walks. West [32] proposed the idea of using random walks to analyze networking protocols whose state spaces were too large for exhaustive search. Sivaraj and Gopalakrishnan [31] propose a method for iterating exhaustive search and random walks to ﬁnd bugs in cache coherence protocols. Both of the above were applied to systems described in specialized languages yielding ﬁnite state systems, and moreover, to check safety properties. In contrast, M ACE MC uses random walks to ﬁnd liveness bugs by classifying indeterminate states as dead or recovering, and further, to pinpoint the critical transition.
Model Checking by Systematic Execution. Two model checkers that directly analyze implementations written in
C,C++ are V ERISOFT [15] and CMC [27]. V ERISOFT, from which M ACE MC takes the idea of bounded iterative DFS, views the entire system as composed of several processes that communicate by means of message queues, semaphores and shared variables that are visible to V ERISOFT. V ERISOFT acts as a scheduler for these processes and traps calls made to access shared resources.
By choosing which process will execute at each such trap point, the scheduler is able to exhaustively explore all possible interleavings of the processes’ executions. In addition, it performs stateless search and partial order reduction owing to which it has found critical errors in a variety of complex programs. Unfortunately, V ERISOFT is unable to exploit the atomicity of M ACE’s transitions, and this combined with the stateless search meant that it

was unable to exhaustively search out to the depths required to ﬁnd the bugs M ACE MC found. A more recent approach, CMC [27], also directly executes the code and explores different executions by sitting at the scheduler level. However, instead of using partial-order reductions, it stores states by hashing to avoid re-exploring states.
CMC has found errors in implementations of network protocols [26] and ﬁle systems [34]. JAVA PATH F INDER
[18] takes an approach similar to CMC for JAVA programs. Unlike V ERISOFT, CMC, and JAVA PATH F INDER,
M ACE MC addresses some the challenges of ﬁnding liveness violations in systems code and simplifying the task of isolating the cause of a violation.
Model Checking by Abstraction. A different approach to model checking software implementations is to ﬁrst abstract them to obtain a ﬁnite-state model of the program which is then explored exhaustively [3, 5, 7, 8, 12, 16, 21] or up to a bounded depth using a SAT-solver [10, 33].
This model can be obtained either by restricting the types of variables to ﬁnite values, as is done in F EAVER [21],
BANDERA [12],M OPS [8] or by using predicate abstraction [3, 16] as is done in S LAM [6] and B LAST [19]. A related technique is bounded model checking implemented in CBMC,S ATURN where all executions up to a ﬁxed depth are encoded as a propositional formula, which is then solved using fast modern SAT solvers to ﬁnd bugs.
Of the above, only F EAVER , BANDERA can be used for liveness checking for concurrent programs, and they require a user to manually specify how to abstract the program into a ﬁnite-state model.
Isolating Causes from Violations. Naik et. al. [28] and
Groce [17] propose ways to isolate the cause of a safety violation by computing the difference between a violating run and the closest non-violating one. M ACE MC instead uses a combination of random walks and binary search to isolate the critical transition causing a liveness violation, and then uses a live path with a common preﬁx to help the programmer understand the root cause of the bug.

9 Conclusions

21

Correctness conditions for software systems typically fall across two categories, safety and liveness. Safety violations correspond to known bad operations. Liveness violations correspond to cases where the system is unable to ever achieve some desirable global system property after some event, a critical transition, takes place. Specifying and checking for both safety and liveness properties is important for building robust software systems. In this paper, we argue that existing model checkers of systems implementations cannot ﬁnd liveness violations because doing so would require searching further into the search space than feasible anytime in the foreseeable future. We

bridge this gap by developing probabilistic techniques to [11] C LARKE , E. M., AND E MERSON , E. A. Synthestrategically sample the state space beyond what can be sis of synchronization skeletons for branching time explored exhaustively. We present the design, implementemporal logic. In Logic of Programs (1981), LNCS tation, and evaluation of a fully functional model checker,
131, Springer-Verlag, pp. 52–71.
M ACE MC. We detail our experience with running three [12] C ORBETT, J., DWYER , M., H ATCLIFF , J., complex, mature systems through M ACE MC, ﬁnding a
PASAREANU , C., ROBBY, L AUBACH , S., AND variety of both safety and liveness violations with a no
Z HENG , H. Bandera : Extracting ﬁnite-state models false positives. Finally, we present a series of tools includfrom Java source code. In ICSE 00: International ing an interactive system state debugger and event graph
Conference on Software Engineering (2000), visualization to ease the problem of ﬁnding and ﬁxing repp. 439–448. ported liveness violations in systems code.
[13] DABEK , F., Z HAO , B., D RUSHCEL , P., K UBIA TOWICZ , J., AND S TOICA , I. Towards a Common API for Structured Peer-to-Peer Overlays. In
2nd International Workshop on Peer-to-peer SysReferences tems (IPTPS’03) (Feb. 2003).
[14] D ILL , D., D REXLER , A., H U , A., AND YANG ,
[1] Freepastry: an open-source implementation of pasC. H. Protocol veriﬁcation as a hardware design try intended for deployment in the internet. http: aid. In ICCD 92: International Conference on Com//freepastry.rice.edu, 2006. puter Design (1992), IEEE Computer Soceity Press,
[2] Mace: A toolkit and language for building dispp. 522–525. tributed systems. http://mace.ucsd.edu,
[15] G ODEFROID , P. Model checking for programming
2006.
languages using Verisoft. In POPL 97: Principles
[3] AGERWALA , T., AND M ISRA , J. Assertion graphs of Programming Languages (1997), ACM, pp. 174– for verifying and synthesizing programs. Tech.
186.
Rep. 83, University of Texas, Austin, 1978.
[16] G RAF, S., AND S A¨DI , H. Construction of abstract
I
[4] A LUR , R., H ENZINGER , T., M ANG , F., Q ADEER , state graphs with PVS. In CAV 97: Computer-aided
S., R AJAMANI , S., AND TASIRAN , S. M OCHA:
Veriﬁcation, LNCS 1254. Springer-Verlag, 1997, modularity in model checking.
In CAV 98: pp. 72–83.
Computer-aided Veriﬁcation, A. Hu and M. Vardi,
[17] G ROCE , A., AND V ISSER , W. What went wrong:
Eds., Lecture Notes in Computer Science 1427.
Explaining counterexamples. In SPIN 03: Spin
Springer-Verlag, 1998, pp. 521–525.
Model Checking and Software Veriﬁcation (2003),
[5] BALL , T., AND R AJAMANI , S. The SLAM project: pp. 121–135. debugging system software via static analysis. In
[18] H AVELUND , K., AND P RESSBURGER , T. Model
POPL 02: Principles of Programming Languages checking Java programs using Java Pathﬁnder. Soft(2002), ACM, pp. 1–3. ware Tools for Technology Transfer (STTT) 2(4)
[6] BALL , T., AND R AJAMANI , S. K. Boolean pro(2000), 72–84. grams: a model and process for software analysis.
[19] H ENZINGER , T., J HALA , R., M AJUMDAR , R.,
Tech. Rep. MSR Technical Report 2000-14, MiAND S UTRE , G.
Lazy abstraction. In POPL crosoft Research, 2000.
02: Principles of Programming Languages (2002),
[7] B YRON C OOK , A NDREAS P ODELSKI , A. R. TerACM, pp. 58–70. mination proofs for systems code. In PLDI 06:
[20] H OLZMANN , G. The Spin model checker. IEEE
Programming Language Design and ImplementaTransactions on Software Engineering 23, 5 (May tion (2006), ACM, p. to appear.
1997), 279–295.
[8] C HEN , H., AND WAGNER , D. MOPS: an infrastructure for examining security properties of software. In [21] H OLZMANN , G. Logic veriﬁcation of ANSI-C code with SPIN. In SPIN 00: Spin Model Checking and
ACM CCS 02: Conference on Computer and ComSoftware Veriﬁcation (2000), LNCS 1885, Springer, munications Security (2002), ACM, pp. 235–244. pp. 131–147.
[9] C LARKE , E., G RUMBERG , O., AND P ELED , D.
[22] JANNOTTI , J., G IFFORD , D. K., J OHNSON , K. L.,
Model Checking. MIT Press, 1999.
K AASHOEK , M. F., AND JAMES W. O’T OOLE ,
[10] C LARKE , E., K ROENING , D., AND L ERDA , F. A
J. Overcast: Reliable Multicasting with an Overlay tool for checking ANSI-C programs. In Tools and
Network. In OSDI 00: USENIX Symposium on OpAlgorithms for the Construction and Analysis of Syserating Systems Design and Implementation (Octotems (TACAS 2004) (2004), K. Jensen and A. Podelber 2000). ski, Eds., vol. 2988 of Lecture Notes in Computer
´
[23] KOSTI C , D., RODRIGUEZ , A., A LBRECHT, J.,
Science, Springer, pp. 168–176.
22

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

B HIRUD , A., AND VAHDAT, A. Using Random
Subsets to Build Scalable Network Services. In
Proceedings of the USENIX Symposium on Internet
Technologies and Systems (March 2003).
´
KOSTI C , D., RODRIGUEZ , A., A LBRECHT, J.,
AND VAHDAT, A. Bullet: High bandwidth data dissemination using an overlay mesh. In SOSP 03:
ACM Symposium on Operating Systems Principles
(Oct 2003).
M C M ILLAN , K. L. A methodology for hardware veriﬁcation using compositional model checking. Science of Computer Programming 37, (1–3)
(2000), 279–309.
M USUVATHI , M., AND E NGLER , D. R. Model checking large network protocol implementations.
In NSDI 04: USENIX/ACM Symposium on Networked Systems Design and Implementation (2004), pp. 155–168.
M USUVATHI , M., PARK , D., C HOU , A., E NGLER ,
D., AND D ILL , D. CMC: A pragmatic approach to model checking real code. In OSDI 02: Operating
Systems Design and Implementation (2002), ACM.
NAIK , M., BALL , T., AND R AJAMANI , S. From symptom to cause: Localizing errors in counterexample traces. In POPL 03: Principles of Programming Languages (2003), ACM, pp. 97–105.
Q UEILLE , J., AND S IFAKIS , J. Speciﬁcation and veriﬁcation of concurrent systems in CESAR. In
Fifth International Symposium on Programming,
M. Dezani-Ciancaglini and U. Montanari, Eds.,
LNCS 137. Springer-Verlag, 1981, pp. 337–351.
ROWSTRON , A., AND D RUSCHEL , P. Pastry:
Scalable, Distributed Object Location and Routing for Large-scale Peer-to-Peer Systems. In Middleware’2001 (Nov 2001).
S IVARAJ , H., AND G OPALAKRISHNAN , G. Random walk based heuristic algorithms for distributed memory model checking. Electr. Notes Theor. Comput. Sci. 89, 1 (2003).
W EST, C. H.
Protocol validation by random state exploration. In Proceedings of the 6th IFIP
WG 6.1 International Workshop on Protocol Speciﬁcation, Testing, and Veriﬁcation (1986), NorthHolland, pp. 233–242.
X IE , Y., AND A IKEN , A. Scalable error detection using boolean satisﬁability. In POPL ’05: Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages (New
York, NY, USA, January 2005), vol. 40, ACM Press, pp. 351–363.
YANG , J., T WOHEY, P., E NGLER , D. R., AND
M USUVATHI , M. Using model checking to ﬁnd serious ﬁle system errors. In OSDI (2004), pp. 273–288.

23

Life, Death, and the Critical Transistion

Similar Documents

Physics

Popular Essays