Today we were down for at least 25 minutes (according to Nagios, actual time was probably longer).
There were three contributing factors to our down-time:
- Connecting to the production DB from development
- Using Rails' dependency reloading in a persistent process
- Using Rails' multiple-connection feature with a class under development
Rails does not clean up ActiveRecord connections in child classes if they are automatically reloaded. Instead the connections just 'leak' and a new connection is created.
Because we were connecting to the production DB, we ended up using more connections than the DB could provide (we actually need to trim this), it wants to use 1.4G with 250 connections, which is too much and too many.
The DB ran out of something (the logs were not verbose enough to tell us what) and when we tried to kill it, it wedged itself trying to shut down. A forced restart brought things back up, and restarting Apache brought us back online.
Fortunately, we will no longer be attempting to connect to the production DB from development machines (this was a not-so-subtle hint that this is a Bad Thing™), eliminating all of the above factors as possible future failure points.
The price to pay will simply be one of developer ease-of-use, which we'll have to live with. (I've been 100% comfortable with it all the time.) It also means that I won't ever need an internet connection to do useful work on the site.