I am unsure how to title this post, but it feels like a good idea to share what happened so that you have some steps to follow if you are in a bit of a panic.

First, here is what happened. Earlier today, I deployed a change that would add a column to a table. There was no backfill on the column. There are no required indexes. Nothing. It’s a blank column on a relatively small (20k rows) table. I even held the code that would use this table for a second deployment.

Our Heroku deploys have been taking quite a while recently (15 minutes or so). I haven’t had time to dig into why yet, so I wasn’t overly concerned about the duration initially. Then I got the first ping that something was offline (Sidekiq error queue). That one felt unrelated. Our monitoring is set to notify us anytime the error queue has more than 100 items. Shortly after, I received notifications that other services were unavailable (timing out). A quick check on the recent deployment shows that the release phase has been running for 20 minutes.

At this point, I decide it is time to kill the deployment. I commnd+D the terminal and checked the current processes on Heroku (heroku ps). I can see the release command is still running. The next thing to check is the database. I can console into the database and other apps that use this same database are all functioning as expected. In addition, as far as I can tell, all the background jobs for this app are still running as expected (we use Sidekiq+Redis, but ultimately all work is done against the PG database).

To be safe, I ran pg:diagnose and could see long-running queries against the table I was attempting to migrate.

Next, I focused on killing the release phase process. In nearly 12 years of using Heroku, I have never had to kill a running process. I find references to ps:stop and ps:kill. Both report they worked, but running us ps, I can see the process is still running. It turns out that you need to include the process type as well: heroku ps:kill release.2343. Better output here would have been helpful.

While this killed the process, the app’s state did not improve. I restarted the processes, which again did not fix the problem. Finally, I figured something was off with the app state, so I rolled back to a previous update (note: the new deploy was never fully deployed and unavailable). This appeared to fix things for a few seconds, but everything on the main app again began to time out.

I checked heroku pg:diagnose again and could see the same long-running queries were still there. There were about 40 or so of them, but I couldn’t get the output in a state where I could quickly grab the PIDs to kill each process, so I went ahead and ran heroku pg: killall. After this, I restarted the main app (and related apps), and everything appears to be working well.

So the takeaways:

  1. Never deploy before coffee. The mind is not ready for this kind of stress.
  2. My best guess is that the connection pool for the main web app somehow got into a bad state. Killing all the connections, I was able to reset it.

I still have to deploy again, but I assume this was a freak condition.