Blog

a test bestiary, part 2

Written by Thomas Henderson
Published on 16 October 2017

In my last post on testing, I described testing in the small: the smoke test (checking if we’ve set up our testing framework correctly), and unit tests (testing components of our system). Now I’ll address testing in the large, starting with a correction to a misunderstanding I had about functional testing.

Functional tests: Testing from the outside, in

Last time, I confused the function in “functional testing” for mathematical functions: relations betweens inputs and outputs that are deterministic, i.e., given the same inputs you will get the same output. I thought this meant, functional tests are for methods that are stateless. Instead, functional testing refers to the practice of writing tests that touch your system in the same way it will be used in production. The “function” in the name refers to how this style of test treats your system as a “black box” that takes inputs to outputs, without your tests being aware of the details of how this transformation takes place.

As I learned from writing Game of Life wrong in three different ways, it’s entirely possible to have tests that pass at the component level (unit tests), and still have a failing application. My Cell and World classes passed the life and death rules, but when I would test well-known Game of Life patterns like the Blinker or the Toad, they wouldn’t behave correctly as the system evolved through time.

When I moved on to Tic-Tac-Toe, one of my mentors advised me to write “outside-in tests” instead. The idea is, write a test that calls an outermost message of your system’s interface, one that would pass if your application did the right thing. Since I was writing a game, my first test checked to see if my system accepted a play message. Then I could test things that ought to be true about a started game: there should be a current player, that player should be player one, the game’s board should be empty.

Driving your code with functional tests is a bit like wish-driven development. You haven’t written any code yet, so these things won’t be true. But the behavior that you fill in with your code is behavior that’s correct from the point of view of the entire system. This keeps you from zooming in too closely and over-defining your components with overly fiddly tests, when you don’t know enough about the whole of the system.

You can read more about my experience using functional testing and delaying the breakdown of Tic-Tac-Toe into classes here. As I develop as a crafter, I want to keep using functional testing and the deferral of design decisions until necessary; but I want to get better at test-driving that breakdown of an overly large, but functional, block of code into components, so that design comes after functionality.

“Typically people think about design, and they say design is making this big, involved plan that’s going to solve every issue that the system is supposed to address. But I think that’s not what you do when you’re trying to get a design that’s going to survive and live over time. I think, instead, what you want to do is break things apart in such a way that they can be put back together.” — Rich Hickey, “Design, Composition and Performance”

Acceptance tests: Agreeing on behavior

A great solution to the wrong problem is no solution at all. A testing strategy to prevent the scenario of customers and developers misunderstanding one another so badly that the wrong product is delivered, is a suite of acceptance tests. These are tests that act as a kind of contract between the customer and the development team. They are very similar to functional tests, in that they have the same kind of outside-in, black-box shape. What distinguishes them from an ordinary functional test is the level of customer involvement. From extremeprogramming.org: “Customers are responsible for verifying the correctness of the acceptance tests and reviewing test scores to decide which failed tests are of highest priority.”

Last week, I worked with another apprentice on the bowling score kata. We had a lot of trouble when we tried to complete our algorithm to work for a perfect game. It wasn’t until an experienced bowler overheard us failing to understand a subtle distinction between our algorithm and the actuality of scoring a bowling game, and corrected us, that we realized the problem wasn’t our programming ability, but our understanding of bowling itself. Acceptance tests exist as a formal method to prevent this kind of domain misunderstanding from happening.

The Chicago apprentices recently attended a zagaku on ubiquitous language and domain-driven design, which has similar aims to the notion of acceptance testing. This is the discipline of having developers work closely with the users of the software in order to learn their language for the domain. The theory goes that if this common language is pursued with sufficient rigor, the problem domain and the software that grows to model that domain will approach isomorphism — mathspeak for “the same in every way that matters.” The developers can use additional abstractions for code re-use and performance, but these abstractions will serve the true needs of the users. I am really looking forward to learning more about this, and the degree to which the practice is harmonious with the more formal, more automatable acceptance tests.

Characterization Tests

Just because software should have tests doesn’t mean that it will. There may be few or no tests. Tests may be numerous, but they might be low-value tests, ones that are highly coupled to the implementation and break as soon as someone tries to improve the design. A code base may be full of sinful hacks, done under pressure; they may be clever hacks, but cleverness can be obscure, hard to understand by a developer who’s just joining the project, or even to the original author, a few weeks or months later.

When code seems to work only by dark magic, when the prospect of changing it is terrifying, and the only comments present may as well read # This function is full of spiders, it might be time for characterization tests.

I learned about characterization tests from two excellent talks by Katrina Owens, and from sitting in on one of 8th Light’s Weyrich Institute sessions.

I first learned of characterization tests from Katrina Owens’ talk, 467 tests, 0 failures, 0 confidence. Owens considers the case of an open source project with abundant but low-value tests. These tests are freezing the design solid, coupling the code to a swarm of tiny tests. She deletes them and writes new outside-in tests in order to pin down the behavior of the project. Since they’re outside-in, they are a type of functional test. But the code already exists, and she seeks to pin down the behavior while freeing future maintainers to refactor and redesign more aggressively. It’s the testing-after that makes this approach into a characterization test approach: she is characterizing the current behavior. It’s a good talk, and Owens frequently references Sandi Metz’s great Magic Tricks of Testing talk. I’m going to have to watch this talk again, because I felt it really drove home the why of Metz’s strategic recommendations.

I liked Therapeutic Refactoring even more. Owens tackles a big block of heavy, obscure code. She knows that it “works” in the sense that it is being used in production without functional complaints, but it is definitely full of spiders and it cannot be understood, much less be changed easily. Her technique is to write a test that takes a reasonable domain-appropriate input and expects that the code will return the string "something". Of course, it doesn’t return "something", but the errors that her test suite returns guide her toward a genuine, functional, passing test. This puts her in the green, giving her the power to refactor furiously. She chops the overlong method into bite-sized pieces, improving the names as she learns how the different pieces contribute to the overall behavior of the function. When it makes sense, she can add unit tests of her broken-out components. In Weyrich, I learned that output you may use as a definition of “correct” behavior is sometimes called a “golden master,” and got a chance to try this technique of doing test-driven refactoring against known correct output.

I’m interested in this kind of test because I like the idea of capturing behavior of unfamiliar code and refactoring it without fear. It sounds like a great technique for contributing to open source projects, raising the value of tests so that more contributors can make fearless changes. Also I hear that, with some frequency, consultants like 8th Light Crafters don’t get called in because everything is going great. One crafter went so far as to advise us, “Assume that everything is on fire.” Using functional testing techniques like acceptance tests (ensuring we’re solving the right problem) and characterization tests (to rescue troublesome but functional legacy code from bankruptcy and total redesign), sound like effective tools for software firefighting.