In this post I’d like to talk about a specific issue in SQL and all the various ways you could approach it. Specifically, I’d like to talk about dealing with a parent entity’s child data. As an example lets use a very simple document database that stores documents and their versions. Obviously, a document will have many versions but each version will have only one document, as in this diagram:
This is not a complete database, clearly, but it indicates that the DocumentVersion table is a child of the Document table.
Now we get to the part where this gets at least partly interesting. Lets say we’re going to write a search that returns documents. In the results we want to display information about the current version of each document as well as the total number of versions for each document.
This is a surprisingly non trivial query…
( select count(Id) from DocumentVersion where DocumentId = d.Id ) as numVersions
from Document d
inner join DocumentVersion v on d.Id = v.DocumentId
where v.Id = ( select max(Id) from DocumentVersion where DocumentId = d.Id )
Now, there are a bunch of ways to write this, but this is a perfectly good example. Notice we have an inner select in the select clause to get the number of versions and we have another select in the where to get “latest” version. Here I’m depending on SQL Server’s Identity Specification to give me the latest row because it simplifies the query. If we didn’t want to do that, I’d have to either “select top 1” while ordering by the inserted date (which isn’t on the table in our example) or use a row number function and get the row where the row number = 1 again ordered by the inserted date. Both of these query are correlated, meaning they're run for each document in our results.
This query is ugly, but it works. We could optimize it and tweak the way its written to try to get the best possible performance out of it. But is this really the best way to do this? If we think about it, we’re going to be looking at all the versions for every document returned in our search. The more documents we return, the worse this is. But worse, we’re going to do WAY more reads than we are updates in this case. New versions simply are not going to be added that often. So it seems silly to be constantly looking up information about the versions over and over and over and over again when we know its unlikely it will have changed from the last time we looked at it.
Wouldn’t it be better to cache this information on the Document table so we don’t have to keep calculating it repeatedly, thereby simplifying the query and improving its performance?
To do this, we simply add “NumVersions” and “CurrentDocumentVersionId” columns to the Document table. But now we have to keep these columns up to date. There are a few ways to do this:
- Trigger on DocumentVersion updates Document’s cached columns on Insert/Delete
- Code that does inserts or deletes to DocumentVersion must update cached columns
- Cached columns are calculated columns that use a function to lookup values
- Ensures the columns will always be up to date, no matter how the version records are changed
- Code will be in just one place and we wont have to worry about it again
- Slow (like, REALLY slow. Batch operations like inserting or deleting many versions will slow to a crawl)
- Potential for subtle and hard to track down deadlock problems
- Increased code complexity because the trigger must be written to handle ALL cases, even if you only use some (ex: inserting many versions at once)
- Simplest possible code
- Deadlock issues are easier to see and handle
- Same code may end up in many places (ex: Insert and Delete stored procedures, if using sps)
- Potential for error if someone inserts/deletes versions from a new location and forgets to update the cached columns
So, between #1, #2, and #3, which is the right option?
I used to use triggers, because of fear that someone would forget to update the columns if I went with #2. But the performance and deadlocking issues with triggers has now caused me to go with the "API layer" approach of #2.
I think the answer, as always, is it depends. If the tables you're using are likely to be always accessed through an API layer, then you should go with #2. But if many people will be manipulating those tables from many different areas and there is no central API layer, you're pretty much forced to go with #1.
And the question remains, is it really worth caching the data this way, or should you just keep the lookups in the queries. Once again, my favorite theme for this blog: it depends. The big question is really performance, and that depends on how the queries will be used. Are you going to be returning thousands of results, or just hundreds? Are you going to be running this query often?
In SQL there is no one size fits all rule. And worse, SQL is so complex it has to be treated as a black box, meaning you really can't reason about it. Therefore, your only hope is to test and test and test. You pretty much have to write the query every way you can imagine and then performance test each one... And that takes a lot of time.
As Scott Hanselman would say, "Dear Reader," what do you think? Have you been faced with this issue? What did you do?