Saturday, July 26, 2008

Tests For Your Data

These days automated tests for your code are standard practice in any professional IT shop. There are a variety of automated testing tools in use, from JUnit and TestNG that can run unit tests and integration tests, through Behavior Driven Design, FITness, and others.

I propose we should have tests for our data too.

Code makes a lot of assumptions about the data it works on. Many of these assumptions can be enforced using constraints in the database itself:

  • A PRIMARY KEY constraint defines a unique key for the table.
  • A UNIQUE constraint identifies an alternative candidate key, which also must be unique.
  • A FOREIGN KEY constraint defines a relationship to a parent table, and is used to enforce referential integrity.
  • A CHECK constraint can be used to check arbitrary conditions on the values in a row.

In addition to these constraints, you can also use triggers to check more complex conditions, perhaps involving multiple rows.

Despite all of these, there are many cases where constraints are too awkward or inefficient. Particularly when conditions span multiple rows, database constraints and triggers are not very good for enforcing them.

Here's an example. Suppose we have a table billing_period:

BILLING_PERIOD ACTUAL_START ACTUAL_END
200825 2008-06-04 2008-06-10
200826 2008-06-11 2008-06-17
200827 2008-06-18 2008-06-24
etc


The billing_period table is supposed to contain weekly billing periods, along with the dates belonging to them. Each billing period is supposed to be exactly seven days long. There should be no overlaps or gaps, either. How would we enforce these conditions using constraints or triggers?

You have probably written hundreds of queries to test conditions like this about the database. How about making those queries into a test suite for your production data?

Start with a view like this:

CREATE VIEW chk_billing_period_7_days_long AS
SELECT *
FROM billing_period
WHERE actual_start - actual_end <> 6

This view returns a row for any billing period which is not seven days (actually six days) from its start to its end.

Here's another one, to check for overlaps:

CREATE VIEW chk_billing_period_overlaps AS
SELECT *
FROM billing_period bp1
WHERE EXISTS (
SELECT *
FROM billing_period bp2
WHERE bp2.actual_start BETWEEN bp1.actual_start AND bp1.actual_end
OR bp2.actual_end BETWEEN bp1.actual_start AND bp1.actual_end
)

Finally, to check for gaps:

CREATE VIEW chk_billing_period_gaps AS
SELECT *
FROM billing_period bp1
WHERE EXISTS (
SELECT *
FROM billing_period bp2
WHERE bp2.actual_start > bp2.actual_end
)
AND NOT EXISTS (
SELECT *
FROM billing_period bp3
WHERE bp3.actual_start = bp2.actual_end + 1
)

None of these views should ever return any results. If any of them does, we have a data integrity problem. The problem may cause our application's views or code to fail, because of violated assumptions.

Because we named all the views according to a convention (they all start with chk_) we can easily write a program that iterates over these views and tests them all. This program could be scheduled to run every day. It could email us results from any check view that returns data.

If our database supports it, we can add descriptive comments to the views, such as:

COMMENT ON TABLE chk_billing_period_overlaps IS
'Overlap between two or more billing periods'

This comment would make a nice subject line for an email message.

It's easy to add more check views: just define a view beginning with
chk_.

I've gotten into the habit, when I'm designing application code or view logic, to think about the assumptions. If an assumption can be reasonably enforced with a database constraint, I will add a constraint. Otherwise, I write a check view for the assumption and add the check view to the database. Also, just as when I find a bug in my application code, I write a unit test to expose it, so I also write check views to expose data bugs I find in the database.

A scheduled job runs every check view every day, and emails data problems to the team. The views comprise a test suite for our data. The scheduled job gives us continuous integration of sorts. We are alerted to problems virtually as soon as they happen. (Well, the next day.)

We've been running this system at Red Energy for a couple of years now. On one application, featuring about 150 tables, we have a little over 100 check views. I think we should have a lot more. Even so, this system has allowed us to maintain a very high level of data integrity.

2 comments:

Nigel said...

Hi John

Interesting article. I've got a few questions to help me understand the circumstances in which you would recommend this approach.

Are there any types of data that have you found this approach to be particularly good or bad for?

Do you always have matching constraints in your application, so this is really being used as a double check? Or is this used in cases where there aren't application checks (eg. manual load)?

Do you run this in all your environments? In which environment do you find the most failures? Do you get production failures for these tests?

Considering the Continuous Integration analogy, this approach is at the "nightly build" end of the CI spectrum. Are these views used in a more continual manner during development? These views could be useful as assertions in your integration and system tests, rapidly identifying any code that is violating the constraints.

Cheers
Nigel

John Hurst said...

Nigel,

Those are great questions. I wanted to address some of them in the post,
but didn't want it to be too long. (It's already too long.)

Since I can't answer the questions effectively in a short comment, I've
written a new post!

JH