Software Hero Worship: 2008

Tuesday, September 16, 2008

Linux on new Laptop

I bought a new laptop a couple of weeks ago, because Zena's old hand-me-down IBM R50p's screen bit the dust. So she gets my IBM/Lenovo Z61p, and I bought a Sony VGN-Z17GN. Wow, Sony have long model numbers.

The thing that sold me on the Sony was the high spec in the small (and elegant) package. It has 4GB RAM, 320 SATA hard drive, dual core processor and a 1600x900 screen. But the machine is tiny, and weighs only 1.5kg. In a small laptop bag and with the power supply, it's lighter than my previous laptop's backpack, empty.

Like most people, I don't care for Vista. So I actually applied the XP Pro upgrade option before I even got the machine home. But the OEM XP Pro disk did not include the machine's specific drivers. So I had to download a ZIP bundle from Sony which was supposed to include all of the drivers. It did include 20 of them, which meant about 20 times of clicking through the installer, accepting whatever inane licence agreement, and rebooting. Sony didn't include the ethernet driver in their bundle, so I hunted that down on the Intel web site. It in turned required MSXML, so I had to find that and install it too. In the end, a fairly typical experience of installing Windows, mind-numbingly tedious.

After I finally got XP working, more or less, I popped in a freshly-burned Ubuntu disk and began the Linux install. Wow, what a difference. I had actually been a little worried about how hard it would be to get the video annd ethernet working with Linux, since the machine is quite new on the market. (I have the first one sold in New Zealand.) Well, in a few minutes, with perhaps one or maybe two reboots, I had Ubuntu installed, and everything just works. Everything I care about, anyway. I have no idea whether the fingerprint reader is supported in Linux, but I don't care. To be honest, the wireless network doesn't work yet, but apparently it is supported directly in the next Ubuntu version, due out next month, so I'll just wait for that.

Once I restored my home directory from a backup, all my desktop and configuration settings were ready to go on the new machine. No registry hacking, no special software for migrating settings. It's funny how with Windows, a lot of the "features" are workarounds for problems that don't exist in other operating systems.

Monday, July 28, 2008

Tests For Your Data 2: When to Use

My friend Nigel Charman commented on Tests For Your Data with some good questions.

First the short answers.

Where possible the constraints are matched in the application. So the "double check" idea holds true. Occasionally this turns up bugs in the application. But more often it turns up bugs in manual edits of data.

With referential constraints, despite what some people seem to think, you need to define them in the database, regardless of whether they are also enforced at the application level. Exactly the same here. Double checks are useful, and applications are not infallible.

In my experience I'm running these only in production, and yes I regularly get production failures for them. (That's why I do them. ;-)

I probably shouldn't have hijacked the Continuous Integration metaphor for this idea. Basically this is a data management practice, and doesn't have much to do with the development cycle. However, it is a practice I am very passionate about. Just as a good test suite keeps my code healthy and vigorous, I feel that these data checks help me keep my production data healthy and clean.

Now for the meaty question: For what types of data is this approach suited?

Data from external sources. For data entered interactively, it's usually best to reject invalid data immediately at point of entry. For external-sourced data that are loaded in batch, this is not always the best way. Sometimes data are correlated with data loaded via another batch stream, in a separate transaction. There isn't always a good place to validate and reject bad data. In such cases, check views help catch the bad data.

Data in 3rd-party systems. We have applications for which we are not the developer. So we have little or no control over the application logic or database constraints. But with check views, at least we can identify data problems and work to fix them.

Multi-row conditions. The classic case here is checking for gaps and overlaps between multiple rows containing date ranges. In my work at Red Energy, we have many tables with bitemporal data (two time dimenions). It's quite hard to visualize these data simply by looking at the tabular form, so it's useful to have check queries to verify that the "shape" of the data is valid.

Aggregations. We have cases where different tables aggregate the same basic information by completely different keys. After having the IT manager complain to me twice about two reports (driven from two tables) not balancing, I added a check query to verify that the aggregations balance. The next time the problem happened, I was the first to know.

Parent-to-child relations. Foreign keys can enforce that every child has a parent. Sometimes you want to enforce that every parent has at least one child.

"Future" conditions. Sometimes you have "static" data that cover a range of time, such as calendar data or pricing. You enter data for the next three years, and then start running your system. A carefully-written check view can remind you when it's time to update for the next three years.

Believe it or not, we also have some check views on the check views. There is one that warns if any view (or PL/SQL package) in the database contains errors. There is another that verifies that every check view has a comment.

Saturday, July 26, 2008

Tests For Your Data

These days automated tests for your code are standard practice in any professional IT shop. There are a variety of automated testing tools in use, from JUnit and TestNG that can run unit tests and integration tests, through Behavior Driven Design, FITness, and others.

I propose we should have tests for our data too.

Code makes a lot of assumptions about the data it works on. Many of these assumptions can be enforced using constraints in the database itself:

A PRIMARY KEY constraint defines a unique key for the table.
A UNIQUE constraint identifies an alternative candidate key, which also must be unique.
A FOREIGN KEY constraint defines a relationship to a parent table, and is used to enforce referential integrity.
A CHECK constraint can be used to check arbitrary conditions on the values in a row.

In addition to these constraints, you can also use triggers to check more complex conditions, perhaps involving multiple rows.

Despite all of these, there are many cases where constraints are too awkward or inefficient. Particularly when conditions span multiple rows, database constraints and triggers are not very good for enforcing them.

Here's an example. Suppose we have a table billing_period:

BILLING_PERIOD ACTUAL_START ACTUAL_END
200825 2008-06-04 2008-06-10
200826 2008-06-11 2008-06-17
200827 2008-06-18 2008-06-24
etc

The billing_period table is supposed to contain weekly billing periods, along with the dates belonging to them. Each billing period is supposed to be exactly seven days long. There should be no overlaps or gaps, either. How would we enforce these conditions using constraints or triggers?

You have probably written hundreds of queries to test conditions like this about the database. How about making those queries into a test suite for your production data?

Start with a view like this:

CREATE VIEW chk_billing_period_7_days_long AS
SELECT *
FROM billing_period
WHERE actual_start - actual_end <> 6

This view returns a row for any billing period which is not seven days (actually six days) from its start to its end.

Here's another one, to check for overlaps:

CREATE VIEW chk_billing_period_overlaps AS
SELECT *
FROM billing_period bp1
WHERE EXISTS (
SELECT *
FROM billing_period bp2
WHERE bp2.actual_start BETWEEN bp1.actual_start AND bp1.actual_end
OR bp2.actual_end BETWEEN bp1.actual_start AND bp1.actual_end
)

Finally, to check for gaps:

CREATE VIEW chk_billing_period_gaps AS
SELECT *
FROM billing_period bp1
WHERE EXISTS (
SELECT *
FROM billing_period bp2
WHERE bp2.actual_start > bp2.actual_end
)
AND NOT EXISTS (
SELECT *
FROM billing_period bp3
WHERE bp3.actual_start = bp2.actual_end + 1
)

None of these views should ever return any results. If any of them does, we have a data integrity problem. The problem may cause our application's views or code to fail, because of violated assumptions.

Because we named all the views according to a convention (they all start with chk_) we can easily write a program that iterates over these views and tests them all. This program could be scheduled to run every day. It could email us results from any check view that returns data.

If our database supports it, we can add descriptive comments to the views, such as:

COMMENT ON TABLE chk_billing_period_overlaps IS
'Overlap between two or more billing periods'

This comment would make a nice subject line for an email message.

It's easy to add more check views: just define a view beginning with
chk_.

I've gotten into the habit, when I'm designing application code or view logic, to think about the assumptions. If an assumption can be reasonably enforced with a database constraint, I will add a constraint. Otherwise, I write a check view for the assumption and add the check view to the database. Also, just as when I find a bug in my application code, I write a unit test to expose it, so I also write check views to expose data bugs I find in the database.

A scheduled job runs every check view every day, and emails data problems to the team. The views comprise a test suite for our data. The scheduled job gives us continuous integration of sorts. We are alerted to problems virtually as soon as they happen. (Well, the next day.)

We've been running this system at Red Energy for a couple of years now. On one application, featuring about 150 tables, we have a little over 100 check views. I think we should have a lot more. Even so, this system has allowed us to maintain a very high level of data integrity.

Tuesday, June 3, 2008

JAOO Sydney 2008

The message from JAOO 2008 is: Software development is in crisis. To get over it, we're all going to be programming in functional languages and deploying in the Cloud. Maybe fundamentalist functional languages.

Dave Thomas is thought-provoking. But he can't make up his mind whether shiny things are the new one-true-path or the spawn of the devil. He's a great devil's advocate. He'll tell you that he'd rather maintain a legacy COBOL application than a legacy Java app. Then he'll tell you that the "new SQL" is to be found in the query features in LINQ. Then he'll say OO is the best technology for product development, but a terrible thing for enterprise software. He thinks the computer scientists have been running the asylum.

Martin Fowler is an opinionated grump but I like him more and more. It's probably just because he agrees with many of the things I believe. XSLT was a bad technology for writing user interfaces (or whole applications). OO is still a good way to write complex domain logic. You cannot write serious applications with Doodleware. It's better to evolve a framework out of a working application
than to start with the framework before writing the application.

There was a video clip of him claiming that it isn't "significantly" harder to design a DSL than to design an API. I have tended to disagree with this, but I am probably confusing DSLs with full-blown languages. If he's talking about the special-purpose "DSL"s such as Grails' GORM DSL, or Ruby-on-Rails, then of course he's right. These are DSLs. They're also glorified APIs in a way. Don't get me wrong: they're great. I'm looking forward to his new book on the subject, though I'm sad to hear that PEAA2 is on the back-burner.

Fowler presented Kent Beck's four features of well-designed software:

Passes the tests
Code shows the intent
No duplication
"Less stuff"

Fair enough. I am interested in whether people feel that tests should be factored the same as production code. These days I am inclined to "show everything" in the test, to make each test easy to understand in isolation. Erik Dörnenburg seems to agree with this. Robert Martin says to factor the tests just the same as production code.

Gregor Hohpe is always fun to watch. He really ought to have a TV show or something. He told us about the "new" ACID properties, for distributed systems:

Asynchronous
Concurrent
Internet-Scale
Distributed

And he hoped he got them 50% right...

SOA seems to be "out", according to many on the Enterprise Systems Panel. I guess that means it isn't shiny any more and the only people pushing it now are the out-of-date vendors who haven't caught on yet. I hope they catch on soon. Like most shiny IT technologies, SOA has probably been very useful to some organisations, been a complete failure for others, and is not particularly significant for the rest of us.

I am very skeptical about the "new SQL" of LINQ. It is very cool and neato and so forth, and a small part of me is kind of envious of the .NET camp for this nice stuff coming out of Microsoft. But application data access technologies come and go nearly every year, and SQL is still with us. None of these conference speakers or other "leading edge" developers at the conference seem to think that ad-hoc queries are important, both for users and developers. I make dozens of them every day. I wonder where these people work.

I talked to a rather drunk ThoughWorker (at the party) who suggested my enterprise app of 100-odd relational tables would be better factored as two or three smaller apps using hash maps for persistence. I think he was serious. He also claimed that the performance of the Cloud was so great that it could substitute one kind of index for another without you knowing or caring. Maybe so. But why bother with an index at all then? You only use indexes when you care about performance, and even then you need to understand the performance characteristics of the index you are using, if you want to get any
benefit from it.

The second day we had Robert Martin. He's also very entertaining. I asked him if his (new) book was as much fun as his talk and he said yes, so I think I will buy it. I didn't expect a talk to change my thinking about functions and clean code, but he may have. For example, he thinks functions should be between one and five lines long. I have a lot that are a bit longer than that, and now I think maybe they are not as good as I thought. He showed that by having really small functions, named very carefully for what they do, the code is really self-documenting. Much more so than if you have, say, 20-30 line functions.

He has these rules for Good Functions:

Small
Do one thing
Use descriptive names
No more than three arguments
No side-effects

I met Rod Johnson (again), and expressed commiseration that he has to explain the licensing for SpringSource Application Platform so much, over and over again. Even many of the Spring people don't understand it. I have been told by two different Spring guys that companies won't be able to run SSAP without paying for it. I guess I'd better go figure out whom I'm supposed to be paying for this copy of Linux on my laptop then, since it's the same license (GPL). (Of course I understand Spring has a commercial license as well as GPL. It's much the same as MySQL in that respect.)