Test Data Setup: Staying clean, DRY, and sane

by Seth Petry-Johnson 24. July 2012 18:27

There are many good reasons to avoid hitting a database in your tests. I agree with all of them, and I try my best to avoid doing it.

However, some tests do need to hit the database. Even the most dependency-injected and mock-infested system should hit the database when testing the data access layer... after all, what good is a test suite that doesn't test any of your actual data access logic? And if you're smart and follow the testing pyramid then you'll have some integration and acceptance tests that need a database as well.

In "Rules for Effective Data Tests" I mentioned some strategies for setting up those data tests. This post expands on those ideas and shows how to keep your setup code clean, DRY and maintainable.

What's so difficult about setting up a data test?

First, a definition. When I say "data setup" I'm talking about anything you do in the body of a test [or a setup method] to create the database records needed for a given test to execute.

While similar to the setup of a "true" unit test, interacting with a Real Life Database™ makes things a little more interesting. Some of the challenges we have to overcome are:

  • Test residue: Unless we delete it, data created by each test remains in the database when the test exits. At best this just wastes space; at worst, it starts to interfere with other tests. (See here for a common solution to this problem) 
  • Database constraints: Foreign key constraints are a real pain. When setting up test data you need to create the entire data graph to satisfy the database constraints, regardless of if those relationships are actually relevant to the test.  
  • Verbosity: Because of the foreign key issues mentioned above, setting up data tests requires more code than setting up a unit test. This makes tests harder to write, harder to maintain, and harder to keep DRY. 
  • False negatives: The more complex the setup, the greater the change that tests will fail not because your application logic is wrong, but because you screwed up the setup. 
  • Painful to debug: Debugging a data test is more difficult and time consuming than a unit test. Not only does the test take longer to run, but debugging it often means poking around in both the application debugger and a database tool.
A daunting list to be sure, but it's manageable.

Characteristics of good setup code

The primary contributor to the quality and maintainability of your data tests is the setup code; the easier it is for someone to understand the specific scenario you are creating, the better equipped they are to maintain that test.

Conversely, the harder the scenario is to understand and maintain, the less value that test will provide over time. Tests that contain an unintelligible jumble of setup code have a very real risk of being deleted (rather than fixed) if they ever break due to new code changes.

So what is "good" setup code? It should be: 

  • Highly expressive (high signal-to-noise ratio). Readers should be able to very quickly understand the scenario(s) you are creating without mentally parsing code. 
  • Highly reusable through the use of default values. If I just need to create a Person, let me call "CreatePerson()" and fill in the details for me. 
  • Easily customizable to each test's needs. Since the customized data are usually very relevant to the test at hand, it should be easy for a reader to spot them.  
  • Maintainable; databases change, and its not uncommon to add a new required field. The fewer changes you need to make to existing test code to support these changes the better.
These characteristics aren't specific to data tests, of course. They apply equally well to setup code of any kind.
So what happens when we apply these principles? Read on for specific suggestions...

Data Helpers: the Object Mother pattern for DB entities

The Object Mother pattern describes a special kind of factory class that encapsulates the instantiation of an object (or group of objects) in a specific state, usually mirroring a common scenario in the underlying domain. For instance, you might have an Object Mother that creates an Order object, adds some Order Items and marks it as Shipped. The goal is to turn a complex initialization process into a one-liner so that it is easier to read and maintain.

We can use this same approach in a data test, except that instead of constructing an object in code we need to create one or more records in the database. I call these classes "Data Helpers" and they generally:
  • Are static classes: These classes have no need to ever be mocked out, and making them static makes them easier to invoke in your tests. Omitting the need to instantiate them increases the signal-to-noise ratio and keeps setup code lean.
  • Follow a naming convention: It's important that other developers can discover and use your helpers, so follow an obvious naming convention. I recommend:
    • Put all Data Helpers in the same namespace
    • Name according to the primary entity being created. OrderHelper, CustomerHelper, etc.
  • Create a single "primary" entity: I find that Data Helpers are best focused around a single primary entity, such as an Order. It's fine if they create child or related data for the primary entity, but they should avoid creating a large number of collaborating entities. See below for how to use "scenario" objects for more complicated setups.
  • Treat performance as an important, but secondary, concern: Data Helpers provide their primary value by reducing the cost to create and maintain data tests, so whenever "speed of execution" and "ease of use" are at odds with each other, favor ease of use. That doesn't mean you shouldn't care about performance, and in fact you should care very much. Just not so much that you erode the overarching goal. You can easily offload the performance hit to the CI server.  (You do have a CI server, right?)
The methods exposed by a Data Helper class should:
  • Use optional parameters for as much as possible: A primary benefit of Data Helpers is dramatically increasing the signal to noise ratio within setup logic. Callers should only have to specify the specific values that are significant to their test; all other properties should be created using reasonable defaults.
  • Are semantic: Don't be afraid to create highly specialized methods, such as CreateOrderWithBackorderedItems(), which usually just delegate to a more general method with a specific combination of arguments. This can dramatically improve maintainability; if you add a new field to the database, and you can easily infer the correct default value based on the semantics of the method call, then you can implement that new field in the helper method without touching any of the existing tests.
  • Return the created entity: The caller probably needs to know about the data that was created, so return the entity object that you just created. 

Data Scenarios: a bunch of Object Mothers working together

Data Helpers are great when you need to create test data, especially if you want to specify a few key properties and use defaults for the rest.

But what if you want to create multiple related entities, or you want to reuse a specific setup in multiple tests? For instance, you need to create a Customer, with completed Orders in the past, and an in progress Order that's ready for checkout. In these cases, I create a special type of Data Helper that I call a "Data Scenario". 

Scenario objects have these characteristics:

  • Create a large or complex set of data: Just like Data Helpers reduce individual object setup to a one-liner, Scenarios reduce multiple object setup to a one-liner.
  • Model real-world scenarios: The whole point of a Scenario is to encapsulate realistic data patterns that might exist in production.
  • Expose a smaller set of configurable defaults: Scenarios tend to expose fewer arguments than Data Helpers because they are better suited to creating general purpose groups of data rather than highly-specific records.
  • Are often used in fixture-level setup: A common pattern is for a group of tests to share a Scenario object that is created in the test fixture's setup routine, and then provide test-specific adjustments to the Scenario via inline Data Helper calls. 
  • Are instantiated, not static: Scenario objects are NOT static methods of a helper class. Instead, they are objects that get instantiated and perform their data manipulations in the constructor. This allows Scenarios to be created, manipulated and passed around as needed.
  • Expose pointers to the interesting data: A Scenario object should contain public properties containing references to the entities it creates (or at least their IDs). This allows test code to further manipulate the Scenario data or to make assertions against it. 

Common objections to these techniques

Some of the specific objections that I've heard are:

  • It takes a lot of time/code to write/maintain helpers: Yes, on a complex system you'll end up with a decent amount of non-production code implementing these helpers. And yes, it requires an investment of time to get started. But I've been using these patterns for two years on a large application and I'm absolutely convinced the effort is justified. Once you get a decent library of helpers set up it becomes really, really easy to write tests... sometimes even easier than setting up expectations in a true unit test!
  • The tests take a long time to run: Yes, they do. You should do your best to avoid hitting the database except when necessary, and you should lean on your CI server to run the whole suite for you. If you can find a way to test the data access code without hitting the database, I'll eat my hat.
  • Its hard to know what helpers exist: True, if you're not the author of the helpers then they are harder to use. That's why it's so important to follow good naming conventions. You can also, you know, talk to your teammates if you create a new helper or wonder if one exists.
  • I don't wanna: If you don't care about testing the data access code, or you don't care about writing good tests, then I got nothin'. Go play in traffic.
Let's face it: data tests suck, but they are a necessary evil. The goal is to maximize their value while minimizing their cost, and that's what these techniques do.

Closing thoughts

In my experience it works best to think of Scenarios as the broad context in which a test will execute; they create all of the background data that is necessary for a test to run, but isn't very significant by itself. Data Helpers are used to create specific data records that are significant to a specific test. Used together, they create a very rich language for setting up your test data in an easy to write, easy to read, and easy to maintain form.

I've been using these techniques on a multi-year, multi-developer, multi-hundreds-of-thousands-LOC project and I am convinced that they are directly responsible for allowing us to maintain high test coverage on a very data-intensive app. 

Happy testing!  

Rules for effective data-driven tests

by Seth Petry-Johnson 17. December 2011 21:36

Tests that hit the database are slow, fragile, and difficult to automate. All the cool kids are using mocking and stubbing and in-memory databases to keep their unit tests isolated and fast, and that's awesome. I do that too (though my "cool kid" status is debatable). 

However, there are times when talking to a real database is necessary. Maybe you're testing actual data access logic, or maybe you're writing some high end integration/acceptance tests, or maybe you're just working in an architecture that doesn't let you mock/stub/inject your way to isolated bliss. If any of that sounds familiar, then this post is for you!

Below is a list of strategies and suggestions for effective data testing that I've collected from years of experience testing large, "enterprisey", data-driven applications. Data tests will never be painless, but following these rules makes it suck less.

Rules for good data tests

  1. Tests should create their own scenario data; never assume it already exists. Magic row IDs kill kittens!
  2. Make liberal use of data helper and scenario setup classes.
  3. Don't use your data access layer to test your data access layer.
  4. Tests should make no permanent changes to the database - leave no data behind!

Rule 1: Tests should create their own data

One of the worst things you can do in a data test is to assume that some record (a customer, an order, etc) exists that fulfills your scenario requirements. This is a cardinal sin for many reasons:

  1. It's extremely fragile; databases change over time, and tests that rely on pre-existing data often break (causing false-negative test failures).
  2. It obscures the test's purpose. A test's setup communicates the data context in which our assertions are valid. If you omit that setup logic, you make it hard for other programmers to understand the scenario that you are testing.
  3. It's not maintainable; other programmers won't know what makes customer ID 5 appropriate for one test and customer ID 7 appropriate for another. Once a test like this breaks, it tends to stay broken or get deleted.
In other words: relying on pre-existing data means your tests will break often, are painful to maintain when they do break, and don't clearly justify why another program should spend time fixing them.
The solution is simple: each test should create each and every test record it will rely on. If that sounds like a lot of work, it can be.... but keep reading to see how to keep it manageable.

Rule 2: Liberal use of data helper and scenario setup classes

Setting up the supporting data for a test sucks. It's time consuming and generally results in a lot of duplication, which then reduces test maintainability and readability. Test code is real code and should be kept DRY like anything else!

I've found it useful to create two different types of helper classes to attack this problem:

  • Data helpers are utility classes that expose methods for quickly creating entities in the system. These classes:
    • Are generally static, for convenience.
    • Exist in the test project, not the main data access project.
    • Create a single object (or object graph), such as a Customer or an Order with its OrderItem children.  
    • Create data with meaningful defaults, but allow the important fields to be explicitly specified where needed. (Optional parameters in .NET 4 FTW!)
  • Scenario objects (aka "fixtures") represent specific data scenarios that might apply to multiple tests, such as the scenario in which a Customer has placed an Order and one of the items is backordered. These classes:
    • Exist in the test project.
    • Have public properties that identify key data in the scenario (e.g. the Customer ID, Order ID, and backordered Item ID).
    • Are instantiated by a test, at which time the scenario data is created.
In short, data helpers are low-level utilities for creating a specific data record in a specific state, while scenario classes represent larger contexts consisting of multiple entities. I have found that while the time needed to create these objects is not trivial, it quickly pays off as new tests are easier and easier to write. 

Rule 3: Don't use your DAL to test your DAL

Tests for DAL code generally set up some data in the database, invoke the DAL, and then verify that the database was properly modified. I've generally found it difficult to use the primary DAL to quickly and concisely verify those assertions.

In some cases, the primary DAL may not expose a suitable API for doing record-level verification. For example, when there are significant differences between the logical schema exposed through the domain layer and the physical schema of the database, it may be impossible (or at least difficult) to write low-level data assertions.

In other cases, especially early in development, using the DAL to test the DAL creates dependency issues. For instance, many tests involve a sequence of events like "get entity by ID, save changes to it, then verify it was changed". If both the GetById() and Save() methods are currently under development then your test will give you inconclusive results until both methods are implemented. 

In all of these cases I've found it valuable to verify data assertions using a LINQ to SQL data context. This provides a convenient, object-based representation of the data schema that is perfectly suited for verifying row-level operations were performed properly. This data context lives in the test project and is automatically regenerated (using SQLMetal.exe) whenever the schema changes, so it's a close-to-zero-effort solution.

You could also use a micro ORM like Massive, or anything else that makes it quick and easy to interact directly with the database.


Rule 4: Tests make no permanent changes to the database

Tests should be run often, and if you follow Rule #1 your tests create a lot of new data when they run. If you don't clean up after them, your test database will quickly grow in size. Also, if you point your test suite at the same database you use to run your app, you'll quickly get tired of seeing that test data accumulate in your views.

The easiest way to prevent this is to wrap each test in a database transaction, and then rollback that transaction at the end of the test. This performs the desired cleanup and also isolates tests running in parallel from interfering with each other's data.

There are a few different ways to approach this. Depending on your needs, check out this or this.



None of these techniques are particularly clever or game changing, but when used together they can significantly improve your data tests:

  • When tests create their own scenario data, you don't need to run them against a particular known state. This reduces maintenance costs significantly.
  • Investing in data helpers and scenario classes makes it easy to add new tests. The easier it is to write tests, the more likely that developers will actually do it.
  • "Close to the metal" abstractions like LINQ to SQL make it easy to write row- and field-level assertions against the database.
  • Adding some "auto rollback" behavior to your data tests keeps your database trim and tidy, no matter how many times you run your test suite.
Happy data testing!

Seth Petry-Johnson

I'm a software architect and consultant for Heuristic Solutions.

I value clean code, malleable designs, short feedback cycles, usable interfaces and balance in all things.

I am a Pisces.

Month List