Measure Your Code Un-coverage

Maximizing code coverage is not the way to maximize the benefits of unit testing. Instead (1) identify the most important user scenarios and (2) measure and analyze the code “un-coverage”.

Why do we unit test? There’s a multitude of reasons. We want to feel confident our code works and we want fast feedback. Other benefits are related more to the structure of the code. Good tests imply good testability which in turn implies proper decoupling, user-friendly APIs etc. But what do we unit test?

Code coverage is a metric often tied to unit testing. Code coverage can mean many things. The most commonly used metric is “line coverage”, which compares the code lines exercised by all your tests with the total number of lines (normally expressed in ELoCs, effective lines of code = code lines excluding blank lines, comments and curly braces). All coverage metrics work like this, comparing something (e.g. ELoCs) with the total number. Other coverage metrics are “function coverage”, “branch coverage” and “path coverage”.

Having a 60% line coverage means your test execution has touched 60% of your code base. It means close to nothing, however, in terms of quality assurance. You could hit 60% of the code lines and still miss the majority of your most important user scenarios. On the flip side, having a 60% line coverage also means 40% of the code lines have never been executed. Now this is useful information. If a line of code has never been executed, there is a chance the code cannot even be executed without crashing. Or formatting the hard drive, you never know.

So instead of focusing on covering your code with tests, follow this procedure:

  1. Identify the most important user scenarios and alternate paths and implement them as unit tests (or tests on other levels if more appropriate, e.g. component, integration or system level).
  2. Measure your code “un-coverage” and observe which parts of the code are never touched by your test cases. Come up with user scenarios that will exercise the un-covered code. Write tests.

You might still have uncovered code after this procedure. Obtaining 0% code un-coverage (100% code coverage) is expensive. All testing is a trade-off between the risk of releasing something broken and the cost of testing it. When you have covered what is important to your end-user and analyzed the remaining un-covered code, functionality-wise your tests are in great shape even if your coverage is nowhere near 100%.

As a side note: In the case of test-driven development, we will end up with close to 0% un-covered code. But with TDD, bullet (1) is more important than ever. It is easy to get carried away focusing on covering each line of production code with tests and lose focus on what’s important: testing the right thing, i.e. the things that are most important for your end-users. Happy testing.

How Do We Know Our Tests Work Tomorrow?

When writing a test case, good practices suggest that we verify the test can fail. But how can we know the test code doesn’t break later?

How do we know our test code work? It helps to apply a test-first mentality to the largest extent possible. This means writing a failing unit test followed by writing production code to pass the test. As a bonus, “failing first” verifies that the test does what it is meant to do: that it verifies the behavior of your code. We make sure the test actually can fail when the code is broken. As with any code, test code is expected to have bugs in it, so the correctness of your test case needs to be verified. All good.

Robert Martin compares unit tests to dual-entry book keeping, which is a nice analogy: “Accountants who don’t hold to the GAAP [Generally Accepted Accounting Principles] tend to wind up in another profession, or behind bars. Dual Entry Bookkeeping is the simple practice of entering every transaction twice; once on the debit side, and once on the credit side.” The tests (debit) make sure the production code (credit) is correct, while the production code makes sure the test is correct. At least this holds when first writing the test code using test-first methdology above. But does it apply later on?

I have said the following on several occasions: “How can this not work? We have a test case covering exactly this!” Digging deep, it turns out that the test code has broken at some point, and I dare not look in the code repository how long ago. Maybe the code was refactored, and an assert was somehow invalidated or removed. (Examples in funny languages: I had a bash script testing an app. I accidentally removed an “exit 1” statement causing it to never fail. Found out by accident when the app was malfunctioning! At another occation, I wrote this in JavaScript: “test.notEquals(200, response.statusCode)”. I later refactored it to be “status_code”. The problem was that in the production code, I had renamed it to “status” (doh!), giving “response.status_code” the value “undefined” which causes the assert to never fail. Found out by accident when the app was malfunctioning! I also have numerous examples in Java…) We have test code to make sure our production code works. The test code is meta code, looking at the real code. But what mechanisms are there to guarantee that the test code can actually fail still? What about the meta-meta level?

Introducing the “test case test case” (meta level 2): I was thinking about using dependency injection to feed my test case (meta level 1) with a mock application (faking meta level 0). Perhaps we could feed the mock application with erroneous behavior in order to trigger the asserts in the test case (meta level 1). Fantastic! But this makes me concerned about the correctness of the “test case test case”. Hmm, better go into meta level 3. Ok ok, sort of kidding of course.

So that was half a joke, but I’m serious about the problem. I see this over and over and it bothers me that we have tests, but we cannot be sure they do what they should. Over time, I would not be surprised if more and more test cases break in most software projects. Sure, I sometime go into my production code and “return zero instead of 200”, just to make sure something breaks in the test suite, but it doesn’t really scale… I want regression tests for my test cases! :) Any ideas?

Behavior Driven Development

Behavior Driven Development (BDD) is a flavor of Test Driven Development (TDD). In BDD, we have a specification focus instead of a test focus. What does that mean? And how can it help us write better software?

I just watched Dave Astels’ talk on “Beyond Test Driven Development: Behaviour Driven Development” (video). The first time I ran across Behavior Driven Development (BDD), it struck a chord. It is branch of Test Driven Development (TDD) that suites my style. And as always, hearing someone else’s viewpoint makes you think about your own practices. I definitively learned a lot from the video which describes the rSpec BDD framework (for Ruby). Let’s review some highlights of BDD.

As mentioned, BDD is related to TDD. In the video, Astels stresses the fact that we should try to get away from the testing mindset, and focus on writing specifications of the system behavior instead. In BDD, we have a specification focus instead of a test focus. The name of a BDD test case should read as if it specifies a small piece of system behavior. For example, let’s say you’re writing software for a web shop (as in the wikipedia article on BDD). One test case could be named “refunded items should be returned to stock” (or refundedItemsShouldBeReturnedToStock). The test code is there to clarify the details of the specification.

From a philosophical standpoint, the word “test” makes me think of testing something existing, while “specification” makes me think of specifying the behavior of something not yet built. It might help your thought process when decoupling the test cases (= the specification) from the underlying implementation. This ties into one of the drawbacks we often see with traditional unit testing. A recurring pattern is e.g. a ShippingOrder class with a corresponding ShippingOrderTests test class. This tightly couples the test code with the production code. Refactoring becomes painful as test classes might become obsolete if you decide to split a production code class in two.

Instead, Dave Astels mentions that in BDD, you should organize your test classes not around production code classes, but test fixtures. Each test class has a test fixture, and that fixture captures a specific system setup. (In short, a fixture creates objects from your production code classes in a setUp() function and cleans everything up in a tearDown() function. The setUp() function is executed before each test case in the test class and tearDown() is executed after each test case.) Thus, all tests in a test class would run against the same system setup. For example, if several test cases revolve around returning or refunding items in your shop, create a RefundItemTests test suite.

Since system behavior is not restricted to a single method or class, the notion of “unit” in “unit testing” needs to be widened. Any reasonable subset of the code could participate in the test fixture (even the full system, as described in my post on edge-to-edge unit testing). I like that, since it allows you to find and test on the stable boundaries within your system, thus making the tests less brittle. This should be contrasted with having a one-to-one correspondence between production and test classes, which will be very brittle as the code evolves.

If you have good or bad experiences with BDD, or just random thoughts on unit testing, please drop a line below.

Object-Oriented Programming Lecture at KTH: Slides etc.

Thank you all that participated in my lecture at KTH October 26, 2011. I had a lot of fun, and we had some good discussions. For you who were not there, it was about object-oriented programming and how to write good code. We used some example code to discuss OO principles and testing. Here’s the presentation (pdf)!

A couple of posts of relevance:

JUnit Max Takes Test-Driven Development to the Next Level

Automatic compilation as you type is useful, but can we take it further? JUnit Max automates the execution of your unit tests.

In the good old days, you would build your software by running make from the command line. The output would tell you where the errors were, and it would be up to you to find the file and line to fix the error. Nowadays, most IDEs are somewhat better. Normally, you press a key to build. The IDE then helps you navigate the errors by taking you to the file and line. This still involves a context switch: you will have to stop writing code in order to press a key, and subsequently to look at the error messages. Doing this 50 times per day, and the disruption of the context switches becomes noticeable.

automaticModern IDEs offer another improvement. It will run the compiler automatically in the background for you. For example, Eclipse will compile the code while you’re typing and show the errors underlined in red or as a red icon on files that fails to compile. This removes much of the context switch mentioned above. This, combined with the immediate feedback, leads to less interruptions. This is an improvement.

One role of your compiler is actually that of a test tool. In a sense, it tests your code for a set of very specific problems, such as type mismatches, undefined functions etc. It might even warn you about some null pointer issues that would normally surface during execution. The compiler knows a lot about you code, and this is is also why it makes sense turn on “treat warnings as errors”. Without it, warnings will go unnoticed, and you will never have a clean build.

After the compiler tests, the next level of tests are the unit tests. Most IDEs allow you to run your unit tests by the press of a button. Again, you will have the task switch described above. Furthermore, the results are often not as integrated into the IDE as the compiler output. But if the code compiles automatically when we write our code, why can’t the unit tests run too, nice and clean and fully integrated?

This is probably the question Kent Beck asked himself when he came up with the idea of JUnit Max. JUnit Max is an Eclipse plugin designed to help test-driven development. When you save a file, JUnit Max executes the unit tests for you. Results are shown as red icons on failed tests. Since big projects can have thousands of unit tests, JUnit Max contains some extra logic to order the execution of the tests. First of all, it will run the fastest of your unit tests first. This will provide you with fast feedback. Second, it will run those of your tests that failed recently. Those tests are more likely to fail than those failing a long time ago. Using JUnit Max for test-driven development is awesome. Just save your files and it will provide you with instant feedback. I use it when I do “acceptance unit test-driven development” (see post) and it really speeds things up.

So, what is the next level of tests after unit tests? What is the next level of tests that would be beneficial to execute when you press “save”? How about running some functional acceptance tests automatically? It would probably require a full and successful build. The same prioritization between tests as above would be helpful: run the fast and the shaky tests first. Running acceptance tests automatically could work, and could be useful.

What’s next? Load tests? Full integration tests of a system? And after that, deploy automatically? Imagine, you change a single line of code, save, and minutes later, the changes are deployed live. Now that is what I call continuous deployment! Ok, I agree, deploying without committing to a source repository is probably not what you want. :) (I guess for deployment, doing it 50 times per day is cool enough.)

Duplication-Driven Development?

What is the primary force that drives you when doing test-driven development? Surprisingly, it might just be code duplication.

In the book “Test-Driven Development by Example“, Kent Beck uses an implementation of money as an example of TDD. At one point, a test case says “assert(someFunc(5) == 5 * 2)” and to satisfy the test case, someFunc is implemented simply by returning a hard-coded 10. “Now that is dirty” you say (and that’s what I said :). According to the book, the reason for hard-coding is that we want the test case to pass as soon as possible. After that, the refactoring phase can take over and we can clean up the code.

The Petronas Twin TowersHere comes the interesting part: hard-coding is a terrible habit, right? And we don’t want our bad habits to show up in the code. This sign of bad taste should be reason enough to fix it, right? However, there is a more fundamental reason for getting rid of the constant: code duplication. The constant 2 * 5 = 10 occurs both in the test case and in the code. To remove the duplication, change someFunc(param) to return “param * 2” (later, the book also gets rid of the 2 by replacing it with a member variable). To sum up, instead of acting on “taste”, we use the well proven design principle of (avoiding) code duplication.

A fascinating insight from the book is that the design of the code, too, can grow naturally from code duplication. For example, let’s say we have a class that encapsulates some algorithm. When another class with similar behavior is required, the book suggests we can accept some code duplication (even extensive copy-paste) to pass the tests. After getting the tests to pass, we refactor. Lets say the difference between the two classes is the fact that they use different algorithms to carry out their work. Apart from that, they are identical. To avoid code duplication, we combine the two classes. Quite naturally, this results in us isolating the algorithms behind an interface. By following the path of least resistance for removing code duplication, we have implemented the strategy pattern. To summarize, this suggests that code duplication helps us drive the high-level design of our code. If true, it is kind of a comforting feeling that refactoring and code duplication naturally leads us to good design.

All and all, “Test-Driven Development by Example” is well-written, easy to read and quite a good book, albeit maybe more suitable for the aspiring test-driven designer than the seasoned on.

Test-Driven Development Done Right

A couple of years ago, I had at least two misconceptions about Test-Driven Development: (1) you should write all your tests up-front and (2) after verifying that a test case can fail, you should make it pass right away. To better understand TDD, I got a copy of the book “Growing Object-Oriented Software Guided by Tests” by Freeman and Pryce (now one of my absolute favorites). Although the book does a great job explaining the concepts, it took me ten chapters to admit I had been wrong. Let’s never do that again. :)

Let me walk you through my misconceptions so that you don’t have to repeat my mistakes:

Misconception 1: write all tests up-front. Thinking about potential test cases up-front is not a bad thing. It will exercise your imagination and with some luck, many of them will still be applicable after the code is written. But don’t waste energy trying to compile an exhaustive list of tests. At least for me, this approach didn’t work since my imagination appears too limited to come anywhere near the final list. But foremost, I would like to get going writing some test code!

You are better off writing a few happy-path test cases. Filling in the test code will get you started working on the user interface of your classes. When the test code starts acting as a “user” of your interface, it will be obvious to you whether the API is okay or awkward to work with. The tests will drive you to improve your user interface. Creating the user interface will invariably make you think about error cases and how the API can be abused or misunderstood. You will come up with more test cases, and implement these. With some effort, but surprisingly little so, you will grow your test suite.

Misconception 2: fail the test, then make it pass right away. When you have written your test code, filled in the production code to get it all to compile and seen the test fail, it is very tempting to just fix things. Make the changes to have the test pass. Actually, you can do that. But there are at least two reasons not to.

First, I strongly prefer an incremental approach. I fix only the problem reported by the test! If the test says “null pointer exception”, I will fix it. Running the test again, you will get another failure and fix that. This is the convenient/lazy approach, you just let the test drive you. Also, it will result in minimal increments, which is very helpful if another test case would break while changing the production code.

Second, fixing the failed test case right away will throw away a lot of information in the process. When the test fail, it will provide you with valuable information on what went wrong. If you cannot immediately understand what the problem is, maybe you should improve your test or production code? For example, if you get a “null pointer exception”, maybe error handling or an assert earlier in the production code could make sure your program never gets into that kind of a corrupted state. Alternatively, you could improve your test code with all kinds of helpful diagnostics. The idea is, if it takes time for you to understand what went wrong today, imagine how much time will be wasted when the same test case fails in six months. “Growing Object-Oriented Software Guided by Tests” says you should “let the tests talk to you”. There you go, you are test-driven.