Replay: Recreate Every Single Bug

Do you ever spend lots of time trying to understand and recreate a bug scenario? Is there a bullet-proof way to reproduce every single bug? By logging the right information, it should be possible.

Reproducing bugs in complex systems is often hard. Even if the use case that caused the bug is explained in detail, this may not help much. Other factors, such as database state, configuration, timers, network activity and randomness, affect code execution. Without exact knowledge of these factors, you cannot determine which path was taken through your code. Often, debugging information is written to a trace log file to mitigate the problem. However, this is intrusive and clutters the code. Furthermore, by the time a bug is found in a live system, it is too late to add more tracing to the code. We need something else.

As said above, in order to reproduce a bug, we need to know exactly which path was taken through our code. In order to do that, we need to know exactly what decision is taken for every possible branch (if, for, switch etc.). In certain circumstances, this might actually be possible.

Recording the Input

First, let’s assume we have a module without concurrent data access (either single-threaded, by explicit synchronization or by design). Second, vhs cassettewe identify incoming events to the system, such as calls on a public API, GUI user interaction or network packages. They are the entry points to your code. Third, we identify all other external sources of data that might affect the execution. For example, calling a readFromDatabase function will return some data. The use of this data will affect the code execution. Thus, we treat all reads from the database as input to our module. The same goes for configuration, random values and all the other factors mentioned earlier. Combined, we think of the incoming events and the data from external functions as the input to our module.

Fourth, after identifying all module input, we introduce a mechanism to eavesdrop on the incoming data. For each event (e.g. callback userClickedButton or call to a public API function), we store the function arguments to a file. Let’s call this file the interaction log. Similarly, for each external function call (such as a database read), we store the return value or exception thrown.

Replaying From File

In order to reproduce the scenario of the bug, we replay the events from the file by injecting them as function calls into the module. The module will execute its code until it reaches an external function call such as readFromDatabase. Instead of calling the function, we retrieve the return value (or exception) from the file and return (or throw) that instead.

Now, there are a couple of challenges when implementing this approach. First, we’ve restricted ourselves to code that never accesses data concurrently. There’s probably nothing we can do about this. Anyway, from my experience, it is better to organize the software to avoid these concurrency problems since the alternative is just too painful (see post on concurrency).

Second, we want the calls from our module to an external function (e.g. readFromDatabase) to either call a real function (e.g. in the database implementation) or to replay from file. Obviously, we don’t want our module to be aware whether we are replaying or not. In object-orientation, we can achieve this through sub-classing an interface. The module talks to the interface, and behind the interface, there’s either a real entity (the database) or something that replays from file. Thus, all calls from your code to external functions must go through an interface, and never to a concrete implementation directly.

Third, external function calls (such as a database read) have arguments. What if the external function decides to manipulate one of the arguments? For example, it could call a setter method or change a public member. We would have a real problem. The replayed execution would not call the setter method, and the system state would not be equivalent to the bug scenario. Now, to me, manipulating the arguments of a function is poor style. Doing it in an API (for e.g. a database system) is even worse. So hopefully, situations like these are rare. But when they do arise, we will have to restructure our code slightly if we want to be able to replay.

Implementing the Replay Functionality

Java has a very handy mechanism to support the implementation of the replay functionality: reflection. We might have a large number of interfaces through which we call external functions. Nevertheless, using reflection, we can create a single wrapper class that can handle all interfaces. Lets denote it LoggingWrapper. We would create an instance of LoggingWrapper and supply it with an instance of the real class (e.g. the database implementation). We would give the LoggingWrapper object to our module, and our module would think it is talking to the real entity (the database). When our module calls an external function (e.g. readFromDatabase), the LoggingWrapper would forward the function call to the real entity (database) and then log the return value (or exception thrown) to file. If we don’t want to log to file, we would not create a LoggingWrapper. Thus, we would not suffer any performance penalty.

When we want to replay from file, we create a ReplayWrapper and give that to our module. From some other class (e.g. EventReplayer), we would read an event and its arguments (e.g. “the user clicked button X”) from the file and call the corresponding function on the module. When an external function is called (e.g. readFromDatabase), the ReplayWrapper would read a return value (or exception) from file and return it (or throw the exception). As an extension, we could also verify the integrity of the interaction log while replaying. The downside is that this require some extra information to be recorded in the log (such as the full name and argument values of each function call). An integrity check would be able to detect a number of things, such as if the arguments to an external function call differ from when the log was written.

You could imagine implementing the replay functionality per module. But a replay implementation seems complex enough to be non-trivial. It would be useful with a general purpose Replay framework. For fun, I have started sketching on one. Time will tell if/when it will be in good enough shape to be released to the public. All design/code/idea contributions are welcome. :)

The Myths of Innovation

I just finished reading The Myths of Innovation by Scott Berkun. It’s a decent book, and probably required reading nowadays when innovation (whatever it means) is the coolest buzzword of all.

The moral of the book is: there’s no shortcut to creating the next Facebook or Google. You will need an idea, some skills, but most of all, lots and lots of hard work. Newton didn’t invent physics when he got hit by that apple (assuming it really happened), he worked on the problem for years. Unfortunately, after putting in all that hard work, you will still need a lot of luck in order to succeed (adoption, timing, press coverage, competition and so forth). Tough.