Tuesday, June 30, 2009

SQL Deadlocks: More with child data

In last week's post, SQL Performance: Child data, I wandered through an issue involving caching data about a parent table's child data. In that post I talked about writing SQL that would save a foreign key on the parent table to the most recent child record. This is very simple and would look like this:

begin tran TxExample

declare @NewChild int

insert into Child ( blah, blah, blah ) values ( @blah, @blah, @blah )

select @NewChildId = SCOPE_IDENTITY()

update Parent set CurrentChildId = @NewChildId where Parent = @ParentId

if @@TRANCOUNT > 0 commit tran TxExample

if @@TRANCOUNT > 0 rollback tran

This is relatively straight forward. It inserts the child, then updates the parent's cached data. Those two operations are wrapped in a transaction and a try catch to ensure that if anything should fail for any reason, both statements will be rolled back. This ensures data integrity.

And now it's time to talk about deadlocks. This code is susceptible to deadlocks. As a relatively contrived but none the less possible example suppose the following SQL could also be run:

begin tran TxExample2

update Parent set blah = @blah where ParentId = @ParentId


select * from Child where ChildParentId = @ParentId

commit tran
If these two queries were to run at the same time, operating on the same parentId, and SQL Server were to context switch them at just the write moment, they would deadlock. Specifically, if the first query completed its insert statement (line 6) and then SQL switched to the second query, we would deadlock.

This is because when the second query tries to select from the Child table, it will wait because the first query has inserted a new row and SQL Server's default isolation level is read committed, which means dirty data will not be read, instead it will wait for the data to be committed. So it's going to sit there, waiting for the first query to commit.

This isn't a deadlock yet. The deadlock happens when SQL switches back to the first query and attempts to execute the update on the parent. When it does this, it will try to obtain an exclusive lock on that parent row, but it won't be able to because the second query already has an exclusive lock from it's update. So it will wait for the second query to commit.

The first query is now waiting for the second query which is waiting for the first query and you have yourself a deadlock.

Before we fix it, we should ask ourselves "is this a big deal?" The answer is, it depends, but in general yes. If all your SQL is small and all your transactions complete quickly and you don't have very many users banging on the system concurrently then you probably wont see any deadlocks. But unless you can guarantee that all those conditions will remain the same you have to be at least a little worried. And if those conditions don't apply to you, you definitely have to be worried.

So how do we fix it? First thing we could do is to commit the transaction in the second query before executing the select. If this is possible, then it's a good idea. You want your transaction to commit as quickly as possible and you want to touch as few objects as you can while in the transaction. That said, there are plenty of reasons why you might not be able to commit the transaction after the update. For example, maybe you're reading the child data because you need it to perform another update, and those two updates have to be in the same transaction. In that case, there is nothing you can do to fix query #2.

But even if you could fix query #2, someone could some day come along and write query #3 which would introduce the same problem again. So what we really need to do is fix query #1. The way we do that is by having query #1 obtain a shared lock on all the resources we know it will need to touch, immediately at the top of the query.

Add this code after the BEGIN TRY:
set transaction isolation level repeatableread
select ParentId from Parent where ParentId = @ParentId
set transaction isolation level readcommitted

With this code in place, query #2 will not be able to execute it's update until query #1 completes. Thus, preventing the deadlock and saving the day!

This example was simple but the deadlock was still subtle and hard to see. This problem just gets more complicated the more complicated your SQL gets. And your SQL will get more complicated in direct relation to how complicated your data schema is. So you really have to be on the look out for this issue.

Before I wrap this up, I should mention that if you need to lock more than just one row in one table at the top of your query (like we did in query #1), life can get interesting. If the tables you are locking are all related you can lock them by inner joining to them. But if they are unrelated, things get interesting. If they're unrelated, you can't join from one to the next, so you need to execute separate select statements. And if two queries need to lock the same records in two unrelated tables, but they lock them in different orders (A, B vs. B, A) you can end up with a deadlock! For these cases you have to resort to what you learned in your operating systems class: always lock all your resources in the same order. Good luck with that.

I'll leave you with some rules of thumb, which apply to most cases but, of course, not all:
  1. Keep your transactions as small as possible by touching as few objects as possible
  2. Keep your transactions as fast as possible: if you have a query that can execute on n records in a single transaction where n is unbounded you are likely to find yourself in a world of hurt
  3. Obtain shared locks on everything your transaction will eventually require exclusive locks on before you acquire any other locks
  4. If you need to do any reads that don't need to be repeatable, do them before you obtain any shared or exclusive locks (this is really just in keeping with #2)
  5. If you set the transaction isolation level to repeatable read make sure you're setting it back to read committed (even if its the last line of your query, this will make sure triggers don't execute in repeatable read)
Now, all of this has been learned by trial and error, experimentation, and a lot of reading. If you know of other ways around these issues, or if you have a different take, I definitely want to hear about it.

Monday, June 29, 2009

SQL Performance: Child data

In this post I’d like to talk about a specific issue in SQL and all the various ways you could approach it. Specifically, I’d like to talk about dealing with a parent entity’s child data. As an example lets use a very simple document database that stores documents and their versions. Obviously, a document will have many versions but each version will have only one document, as in this diagram:

Document Db Diagram

This is not a complete database, clearly, but it indicates that the DocumentVersion table is a child of the Document table.

Now we get to the part where this gets at least partly interesting. Lets say we’re going to write a search that returns documents. In the results we want to display information about the current version of each document as well as the total number of versions for each document.

This is a surprisingly non trivial query…

select d.Id, d.Name, v.Id, v.FileName, v.Extension, v.FileSize,
( select count(Id) from DocumentVersion where DocumentId = d.Id ) as numVersions
from Document d
inner join DocumentVersion v on d.Id = v.DocumentId
where v.Id = ( select max(Id) from DocumentVersion where DocumentId = d.Id )

Now, there are a bunch of ways to write this, but this is a perfectly good example. Notice we have an inner select in the select clause to get the number of versions and we have another select in the where to get “latest” version. Here I’m depending on SQL Server’s Identity Specification to give me the latest row because it simplifies the query. If we didn’t want to do that, I’d have to either “select top 1” while ordering by the inserted date (which isn’t on the table in our example) or use a row number function and get the row where the row number = 1 again ordered by the inserted date. Both of these query are correlated, meaning they're run for each document in our results.

This query is ugly, but it works. We could optimize it and tweak the way its written to try to get the best possible performance out of it. But is this really the best way to do this? If we think about it, we’re going to be looking at all the versions for every document returned in our search. The more documents we return, the worse this is. But worse, we’re going to do WAY more reads than we are updates in this case. New versions simply are not going to be added that often. So it seems silly to be constantly looking up information about the versions over and over and over and over again when we know its unlikely it will have changed from the last time we looked at it.

Wouldn’t it be better to cache this information on the Document table so we don’t have to keep calculating it repeatedly, thereby simplifying the query and improving its performance?

To do this, we simply add “NumVersions” and “CurrentDocumentVersionId” columns to the Document table. But now we have to keep these columns up to date. There are a few ways to do this:

  1. Trigger on DocumentVersion updates Document’s cached columns on Insert/Delete
  2. Code that does inserts or deletes to DocumentVersion must update cached columns
  3. Cached columns are calculated columns that use a function to lookup values
We'll take these in turn. #1, using triggers, has these benefits:
  • Ensures the columns will always be up to date, no matter how the version records are changed
  • Code will be in just one place and we wont have to worry about it again
However, triggers have these drawbacks:
  • Slow (like, REALLY slow. Batch operations like inserting or deleting many versions will slow to a crawl)
  • Potential for subtle and hard to track down deadlock problems
  • Increased code complexity because the trigger must be written to handle ALL cases, even if you only use some (ex: inserting many versions at once)
On #2, updating cached columns when updating Versions, we have these benefits:
  • Simplest possible code
  • Performant
  • Deadlock issues are easier to see and handle
But it comes with it's own downsides:
  • Same code may end up in many places (ex: Insert and Delete stored procedures, if using sps)
  • Potential for error if someone inserts/deletes versions from a new location and forgets to update the cached columns
#3, using calculated columns, is the same as putting the lookup logic in the query (since you can't persist the value) but has the overhead of a function.

So, between #1, #2, and #3, which is the right option?

I used to use triggers, because of fear that someone would forget to update the columns if I went with #2. But the performance and deadlocking issues with triggers has now caused me to go with the "API layer" approach of #2.

I think the answer, as always, is it depends. If the tables you're using are likely to be always accessed through an API layer, then you should go with #2. But if many people will be manipulating those tables from many different areas and there is no central API layer, you're pretty much forced to go with #1.

And the question remains, is it really worth caching the data this way, or should you just keep the lookups in the queries. Once again, my favorite theme for this blog: it depends. The big question is really performance, and that depends on how the queries will be used. Are you going to be returning thousands of results, or just hundreds? Are you going to be running this query often?

In SQL there is no one size fits all rule. And worse, SQL is so complex it has to be treated as a black box, meaning you really can't reason about it. Therefore, your only hope is to test and test and test. You pretty much have to write the query every way you can imagine and then performance test each one... And that takes a lot of time.

As Scott Hanselman would say, "Dear Reader," what do you think? Have you been faced with this issue? What did you do?

Monday, June 22, 2009

The Paradox of Quality

In college I had a professor named Dr. Lang. He was a great guy, and he was full of great quotes. Some of them were just inane, but others were serious and insightful. Though, usually even those were fairly silly. One of those was, "All programmers are control freaks."

There is a lot of truth to that, but I think there is a deeper reason that causes that control freak behavior. The control freak behavior is just a side effect of a deeper motivation, namely that developers are quality freaks.

The book "First, Break All The Rules" is a really great "management" book that focuses on the concept of "talents." They use the word talent in a different way than we typically use it in conversation, and they break it down in to different kinds of talents. Of particular interest is what they call Striving talents,
"Striving talents explain the why of a person. They explain why he gets out of bed every day, why he is motivated to push and push just that little bit harder. Is he driven by his desire to stand out, or is good enough good enough for him? Is he intensely competitive or intensely altruistic or both? Does he define himself by his technical competence, or does he just want to be liked?"

Given this definition, I think we can make the case that developers have the Quality Striving Talent. This is what drives us to want our code to be perfect and our UIs to be perfect and our architecture to be perfect and our designs to be perfect and on and on. Without this talent, I think a developer would go completely insane!

But, quality is not a very straight forward concept, which brings us to:

Paradox #1: Programmers vs Management
Programmers, being so quality driven, want quality in everything they do. Like, the aforementioned, code, UI, architecture, design, etc.

On the other hand management wants a product that people will buy. Obviously, quality factors in here too: people will want quality in the product they buy, so management will want quality in the product they sell. But management doesn't need quality that goes as deep as the programmers are interested in. After all, very little of the details of programming are visible to the users or to management.

The paradox here is clear, and it explains an awful lot of the friction you can encounter between programmers and management.

Paradox #2: Quality is in the eye of the beholder
For example, one person may like a UI with bright color, gradients, shadows, reflections, slidy animations and some content mixed in where it fits. Other people might look at that UI and see distracting noise and bloat. They would prefer black on white densely packed text emphasizing content.

To each his own I guess. But, which of these UIs is higher quality?

Which of these paintings is higher quality?

Picasso - Weeping Woman  da Vinci - Mona Lisa

Paradox #3: Defining Quality
This has been a lot of talk about quality, but what is quality? Here are some factors you might look at to define the quality of an application:

  1. Bug count
  2. Learnability
  3. User Friendliness
  4. Appearance
  5. Feature parity
But there's a problem, some of these are competing concepts. My favorite example as always is Vim and Notepad. If you sit a normal person down in front of Vim they wont even be able to get text into it. But watch a real Vim wizard work and you'll think its magic. Vim is a powerfully usable editor, but it's very very difficult to learn. On the other hand, any moron can use Notepad, but the morons and the experts both use it the same way. It's learnable, but not terribly powerful, and therefore not nearly as usable.

What about factors for code quality:
  1. Bug count
  2. Simplicity
  3. Cohesion
  4. Readability
  5. Lines of code
  6. Style
  7. Broken Windows
Again, we find competing concepts! Take simplicity and cohesion for example. To achieve cohesion you may follow the SOLID design principles, all of which involve increasing abstraction, adding interfaces, and so forth. But all of these things detract from simplicity and increase the number of lines of code. Paradox!

Realizing that quality means different things to different people at different times is important to remaining sane. Especially when you're a quality driven developer. But as we've seen, quality is a slippery contradictory customer, and this really makes things non-trivial. One thing that might help is focusing on an intended audience to help you decide just what quality matters the most to that audience.  By focusing on an imaginary third party’s quality needs you can remove some of your own quality driven needs and find acceptable compromises.

Tuesday, June 9, 2009

An Anemic Community

I am a .NET developer. I've dabbled with Ruby and Python, learned Java and C++ in college, but there's no two ways about it, I'm a .NET dev. Occasionally I think it would be nice to work at one of those dev shops that hires out bodies to companies that need them so that I'd get to work with more languages. But most of the time I like being specialized, it means I have time to really learn my language and its tools. And it should mean that I turn out higher quality product because of it.

.NET is a great platform. The language is very much at the front of the curve as far as strongly typed languages go, and with C# 4.0 coming it will soon be incorporating a number of dynamic language concepts as well. Plus, as IronRuby and IronPython mature, it may become feasible to work with dynamic languages for some things and static for others, which is cool. Add to this picture LINQ, Windows Forms, WPF, WCF, ASP.NET and MVC and you're looking at a pretty compelling platform.

Now, the platform itself has its issues. The major one being that the tooling is targeted at the "least common demoninator" developer, which can make things interesting for a dev working with non-trivial applications in a non-trivial environment. But, that's a topic for another blog post. For our purposes here, lets agree that .NET as a platform is pretty darn good.

Unfortunately, the .NET community kind of sucks. There has been some improvement here in recent years with things like ALT.NET, but for the most part, the community remains pretty anemic. Compare it with the Java community for example. The vast majority of the non-microsoft products in .NET are ports from Java: nhibernate and nunit for example. Or compare it with jquery, the number of quality plugins and code samples is just ridiculous. It's a bloody fruited plain!

Now take a look at the blog posts that turn up from Google searches for .NET stuff. Especially around WPF and WCF all you find is amature quality stuff. Look at Codeplex, all you find are tons of discontinued projects people started, and what work remains there is terrible. And don't even get me started on the half solutions that riddle Code Project.

Why is this? There are a large number of .NET developers. There are all kinds of conferences and user groups. So why would it be that there is so little happening on the web with .NET. And why is it that what is happening is so bad?

I think its because .NET is a corporate community. The people who are sharing their work or writing code sample blog posts are either amatures, college kids, or "low ranking" people just getting started in their career. The actual "professionals" who are doing good work are doing it for some company or other and are unable to share it on the web! The few who actually do get involved on the web stick with waxing philosophical on their blogs (kind of like what I do...) and don't really add much value. Worse, if there were projects available, many companies wouldn't use them simply because they weren't Microsoft's.

The corporate view point is that any ideas or tools or products that have been developed by their people represent a competitive advantage. Giving away your advantage in the interest of helping the community is just stupid. Who would do such a thing?

Because of this there is little organizing going on. People are not starting useful and interesting projects and finding contributors. They can't, for fear that their employer will find out and they'll get into some kind of trouble they don't understand.

This is a real problem for .NET. I think that's why Microsoft works so hard at planting .NET bloggers in the community, like some kind of covert operation run by the CIA. That's helpful, because those people are writing good useful stuff. But it all falls short of any kind of organizing. And it also tends to keep its sights quite low, not really trying to contribute anything impressive.

The fact that no one is sharing really lowers the averge state of the art across the board. The ridiculous strides taken by Ruby on Rails and jQuery are examples of how much an open oriented platform can do. It makes it easier for people to jump on board and do non-trivial things. It also makes it more likely that as those people gain experience they'll contribute back. If it goes well, you can get a kind of snow ball effect where the people keep getting better and better and so does the technology.

I'm unclear on how the .NET world could begin to do this, since its so corporate driven. All I know is the low quality material makes a .NET developers life much much harder. And at times, its really frustrating.

Monday, June 8, 2009

Thinking like a TDDer

Welcome to my 100th post! I'm going to return to a topic that I've touched on many times in the past, namely, unit testing.

Unit testing is one of those things that is very attractive in principle but gets ugly real fast in practice. I've tried it countless times but never managed to follow through. But in principle, I'm a fan. Having tests to fall back on to help prevent regression and to make refactoring easier and less stressful is a big win. And following the TDD process is very rewarding. Every couple minutes you get another green bubble, encouraging you, telling you you're making progress and doing a good job. Plus, unit testing reinforces the benefits of following good design practices like loose coupling, DRY, and Do One Thing (which are all the same...).

For me personally, unit testing still has 2 primary stumbling blocks:
  1. Mocking
  2. "Maintenance" development
I've talked about many other "pitfalls" including Mocking in an earlier post called Unit Testing Pitfalls. I still think those are true, but I'm coming to terms with the ways around some of them. However, mocking still drives me crazy. I hate having to specify every method and/or property that my object under test is likely to call on the dependency. On the other hand, I hate writing Stub objects because it takes time and effort and you frequently make mistakes that cause tests to fail. Maybe I just need to learn all the features of RhinoMocks (if only the documentation was better) or learn Moq or something. If I find a way to resolve this, I'll let you know.

The second issue is "Maintenance development," by which I mean, going back to code you've written and changing it. When I'm first working on some code I find TDD very easy, but when I have to go back to it to fix bugs or add features I always go straight for the code and cause my Unit Tests to be out of date. And therefore, worthless.

A Continuous Integration server which ran the Unit Tests would force me to keep the tests up to date, but I don't like to be forced to do things. In fact, I'd probably just start commenting out the failing tests so I could get my bug fix checked in. Yeah, I know! Hideous behavior right? Should be punishable by death. But I'm lazy, and so are you.

My problem is that I'm still not thinking like a TDDer. If a TDDer needed to fix a bug, they would start by adding a failing test, then they would work until the test passed, THEN they'd try to repo the bug and make sure it was gone. That's the process they would follow, but what's really interesting in the mindset that process requires.

You've probably heard people say that tests are as good as if not better than documentation. I've even heard people says tests can replace specifications! These people are completely insane! Lock them up in a padded room, throw away the key, and please please please don't let them write any more blog posts!

Ok, I know, they aren't REALLY crazy, they're just using the words to mean something different than what the words REALLY mean. Welcome to software development, we love overrides so much we can't resist overriding words in the English language!

So if you fight through the crazy and think about the documentation argument for a moment, that unit tests are documentation, you can see where this might make sense. First, assume we're talking about API documentation and not end user documentation, obviously. I'd like to see an end user figure out which button they need to click on the screen by reading a bunch of unit tests. Don't worry, I'm sure they'll re-up your contract next year!

Anyway, you have a bunch of tests like, Apply_is_enabled_when_required_fields_are_filled() and so forth. A smart enough and patient enough person could read through tests like this and figure out what the object under test does and how to code against it. In fact, because of the volume of code in the tests, they'd be able to understand and extract much more meaning than they could from some MSDN style documentation.

But can you imagine if the only documentation Microsoft published for the .NET framework was its unit tests? The MSDN style docs are hard enough most of the time! Unit tests ARE NOT documentation! But they are incredibly documentative (made up words are fun!), and they are very very useful for someone who intends to work on the object under test. So, it makes a lot of sense to think about them this way. That is, its a useful mindset.

What about the argument that they are specifications. This one is even more off base. A specification is what you write up front to describe what the software should do and how it should work and basically what it should look like. Yeah, Agile people jump up and down and say you don't need specs. "Just write a story card!" they squeak. "Pin it to your wall!" Sorry little agile buddy, your story card is not a spec. It's not even the replacement of a spec. At some point you're still sitting down and figuring out what the software needs to do, and deciding how it should work. Maybe you're meeting with your "customer representative" and you show him your little story card and he starts rambling on about the details of that story while you furiously scribble notes on a legal pad. Then you sit down with your programming buddy (read: "pair") and you start writing code. Well, the scribbles on your note pad are the spec.

Despite all the condescending language in that paragraph, this is actually fine by me. As long you manage to capture enough detail in that meeting and are able to think through the consequences of your implementation decisions as well as the issues of integrating this part with the other parts of your software, this is a perfectly acceptable way of producing specifications. You might want to be a bit more diligent if you are writing the software that flys the space shuttle, but for most business and banking apps, I think you'll be just fine.

But, back to my point, unit tests are not Specifications. However, they do specify how the object under test should behave. In fact, they *should* specify absolutely everything about how that object should behave, if they're going to be truly effective. And this again indicates something about the mindset of the TDDer.

If unit tests are really going to work, they have to be more than regression tests. And they have to be more than a crutch you lean on during "initial development." They have to be both documentation and specification. This mind set leads you to all kinds of realizations.

For example, to be effective documentation and specification, they have to focus on behavior and they have to have good names. So, "Add_returns_false_null_person" is not a good test name. Maybe you should have gone with "Add_fails_when_person_is_missing."

If you're thinking about your tests this way and someone finds a bug, what's the best way to approach it? Not diving into the code and looking at stack traces trying to find the line of code that's in error, no sir! If there's a bug, you must have missed something in your specification of the problem. So what you're going to do is go look through your spec, find the missing piece that explains the bug, then update your spec.

What this means is that your goal when you're writing your tests shouldn't really be code coverage. And it shouldn't be to test every input/output combination. It should be to fully describe and specify all the required and expected behavior of the object you are testing. These should more or less turn out to be the same thing, but because the mindset is so completely different the details will be different. The details will be better. And you might actually stand a chance at keeping those tests up to date. Heck, you might even enjoy keeping them up to date instead of feeling like it's a chore.

Well, I don't know, that's asking a lot.

Is changing the way you approach your tests going to keep the real life details from getting ugly? Nope. You're still going to end up with test code that is ridiculously repetitive because you have to test the same method with these inputs, and those inputs, and those other inputs. And you'll still have to update all this repetitive code when you decide to change the type of one of the method parameters. And you'll still struggle with different ways to refactor your tests to cut down on the repetitiveness, which adds more abstraction and occasionally doesn't work out so well. And don't forget, you'll still be mocking out the dependencies. You know how I feel about mocking.

But now that your tests are more valueable, at least in your mind, this might all be worth it. I guess we'll just have to try it and see.