Software Hero Worship: July 2008

My friend Nigel Charman commented on Tests For Your Data with some good questions.

First the short answers.

Where possible the constraints are matched in the application. So the "double check" idea holds true. Occasionally this turns up bugs in the application. But more often it turns up bugs in manual edits of data.

With referential constraints, despite what some people seem to think, you need to define them in the database, regardless of whether they are also enforced at the application level. Exactly the same here. Double checks are useful, and applications are not infallible.

In my experience I'm running these only in production, and yes I regularly get production failures for them. (That's why I do them. ;-)

I probably shouldn't have hijacked the Continuous Integration metaphor for this idea. Basically this is a data management practice, and doesn't have much to do with the development cycle. However, it is a practice I am very passionate about. Just as a good test suite keeps my code healthy and vigorous, I feel that these data checks help me keep my production data healthy and clean.

Now for the meaty question: For what types of data is this approach suited?

Data from external sources. For data entered interactively, it's usually best to reject invalid data immediately at point of entry. For external-sourced data that are loaded in batch, this is not always the best way. Sometimes data are correlated with data loaded via another batch stream, in a separate transaction. There isn't always a good place to validate and reject bad data. In such cases, check views help catch the bad data.

Data in 3rd-party systems. We have applications for which we are not the developer. So we have little or no control over the application logic or database constraints. But with check views, at least we can identify data problems and work to fix them.

Multi-row conditions. The classic case here is checking for gaps and overlaps between multiple rows containing date ranges. In my work at Red Energy, we have many tables with bitemporal data (two time dimenions). It's quite hard to visualize these data simply by looking at the tabular form, so it's useful to have check queries to verify that the "shape" of the data is valid.

Aggregations. We have cases where different tables aggregate the same basic information by completely different keys. After having the IT manager complain to me twice about two reports (driven from two tables) not balancing, I added a check query to verify that the aggregations balance. The next time the problem happened, I was the first to know.

Parent-to-child relations. Foreign keys can enforce that every child has a parent. Sometimes you want to enforce that every parent has at least one child.

"Future" conditions. Sometimes you have "static" data that cover a range of time, such as calendar data or pricing. You enter data for the next three years, and then start running your system. A carefully-written check view can remind you when it's time to update for the next three years.

Believe it or not, we also have some check views on the check views. There is one that warns if any view (or PL/SQL package) in the database contains errors. There is another that verifies that every check view has a comment.

These days automated tests for your code are standard practice in any professional IT shop. There are a variety of automated testing tools in use, from JUnit and TestNG that can run unit tests and integration tests, through Behavior Driven Design, FITness, and others.

I propose we should have tests for our data too.

Code makes a lot of assumptions about the data it works on. Many of these assumptions can be enforced using constraints in the database itself:

A PRIMARY KEY constraint defines a unique key for the table.
A UNIQUE constraint identifies an alternative candidate key, which also must be unique.
A FOREIGN KEY constraint defines a relationship to a parent table, and is used to enforce referential integrity.
A CHECK constraint can be used to check arbitrary conditions on the values in a row.

In addition to these constraints, you can also use triggers to check more complex conditions, perhaps involving multiple rows.

Despite all of these, there are many cases where constraints are too awkward or inefficient. Particularly when conditions span multiple rows, database constraints and triggers are not very good for enforcing them.

Here's an example. Suppose we have a table billing_period:

BILLING_PERIOD ACTUAL_START ACTUAL_END
200825 2008-06-04 2008-06-10
200826 2008-06-11 2008-06-17
200827 2008-06-18 2008-06-24
etc

The billing_period table is supposed to contain weekly billing periods, along with the dates belonging to them. Each billing period is supposed to be exactly seven days long. There should be no overlaps or gaps, either. How would we enforce these conditions using constraints or triggers?

You have probably written hundreds of queries to test conditions like this about the database. How about making those queries into a test suite for your production data?

Start with a view like this:

CREATE VIEW chk_billing_period_7_days_long AS
SELECT *
FROM billing_period
WHERE actual_start - actual_end <> 6

This view returns a row for any billing period which is not seven days (actually six days) from its start to its end.

Here's another one, to check for overlaps:

CREATE VIEW chk_billing_period_overlaps AS
SELECT *
FROM billing_period bp1
WHERE EXISTS (
SELECT *
FROM billing_period bp2
WHERE bp2.actual_start BETWEEN bp1.actual_start AND bp1.actual_end
OR bp2.actual_end BETWEEN bp1.actual_start AND bp1.actual_end
)

Finally, to check for gaps:

CREATE VIEW chk_billing_period_gaps AS
SELECT *
FROM billing_period bp1
WHERE EXISTS (
SELECT *
FROM billing_period bp2
WHERE bp2.actual_start > bp2.actual_end
)
AND NOT EXISTS (
SELECT *
FROM billing_period bp3
WHERE bp3.actual_start = bp2.actual_end + 1
)

None of these views should ever return any results. If any of them does, we have a data integrity problem. The problem may cause our application's views or code to fail, because of violated assumptions.

Because we named all the views according to a convention (they all start with chk_) we can easily write a program that iterates over these views and tests them all. This program could be scheduled to run every day. It could email us results from any check view that returns data.

If our database supports it, we can add descriptive comments to the views, such as:

COMMENT ON TABLE chk_billing_period_overlaps IS
'Overlap between two or more billing periods'

This comment would make a nice subject line for an email message.

It's easy to add more check views: just define a view beginning with
chk_.

I've gotten into the habit, when I'm designing application code or view logic, to think about the assumptions. If an assumption can be reasonably enforced with a database constraint, I will add a constraint. Otherwise, I write a check view for the assumption and add the check view to the database. Also, just as when I find a bug in my application code, I write a unit test to expose it, so I also write check views to expose data bugs I find in the database.

A scheduled job runs every check view every day, and emails data problems to the team. The views comprise a test suite for our data. The scheduled job gives us continuous integration of sorts. We are alerted to problems virtually as soon as they happen. (Well, the next day.)

We've been running this system at Red Energy for a couple of years now. On one application, featuring about 150 tables, we have a little over 100 check views. I think we should have a lot more. Even so, this system has allowed us to maintain a very high level of data integrity.

Software Hero Worship

Monday, July 28, 2008

Tests For Your Data 2: When to Use

Saturday, July 26, 2008

Tests For Your Data

About Me