Monday, July 28, 2008

Tests For Your Data 2: When to Use

My friend Nigel Charman commented on Tests For Your Data with some good questions.

First the short answers.

Where possible the constraints are matched in the application. So the "double check" idea holds true. Occasionally this turns up bugs in the application. But more often it turns up bugs in manual edits of data.

With referential constraints, despite what some people seem to think, you need to define them in the database, regardless of whether they are also enforced at the application level. Exactly the same here. Double checks are useful, and applications are not infallible.

In my experience I'm running these only in production, and yes I regularly get production failures for them. (That's why I do them. ;-)

I probably shouldn't have hijacked the Continuous Integration metaphor for this idea. Basically this is a data management practice, and doesn't have much to do with the development cycle. However, it is a practice I am very passionate about. Just as a good test suite keeps my code healthy and vigorous, I feel that these data checks help me keep my production data healthy and clean.

Now for the meaty question: For what types of data is this approach suited?

Data from external sources. For data entered interactively, it's usually best to reject invalid data immediately at point of entry. For external-sourced data that are loaded in batch, this is not always the best way. Sometimes data are correlated with data loaded via another batch stream, in a separate transaction. There isn't always a good place to validate and reject bad data. In such cases, check views help catch the bad data.

Data in 3rd-party systems. We have applications for which we are not the developer. So we have little or no control over the application logic or database constraints. But with check views, at least we can identify data problems and work to fix them.

Multi-row conditions. The classic case here is checking for gaps and overlaps between multiple rows containing date ranges. In my work at Red Energy, we have many tables with bitemporal data (two time dimenions). It's quite hard to visualize these data simply by looking at the tabular form, so it's useful to have check queries to verify that the "shape" of the data is valid.

Aggregations. We have cases where different tables aggregate the same basic information by completely different keys. After having the IT manager complain to me twice about two reports (driven from two tables) not balancing, I added a check query to verify that the aggregations balance. The next time the problem happened, I was the first to know.

Parent-to-child relations. Foreign keys can enforce that every child has a parent. Sometimes you want to enforce that every parent has at least one child.

"Future" conditions. Sometimes you have "static" data that cover a range of time, such as calendar data or pricing. You enter data for the next three years, and then start running your system. A carefully-written check view can remind you when it's time to update for the next three years.

Believe it or not, we also have some check views on the check views. There is one that warns if any view (or PL/SQL package) in the database contains errors. There is another that verifies that every check view has a comment.

No comments: