Monday, October 6, 2014

pg_repack pitfalls

pg_repack is a terrific tool for allowing you to reorganize a table without needing to hold long running strong locks on the table. That means that that your normal inserts, updates and deletes can continue to run against the table while the reorganization is proceeding.

I have had clients who have run into problems with it, however. In particular, it is possible to get it wedged so that the table is inaccessible and nothing can proceed, unless you either kill the repack operation or kill what is blocking it. Here is a simple example of how to cause problems.

In session 1, do:
pg_repack -e -t foo dbnameset

and in session 2 in psql do:
select pg_sleep(10); lock table foo; rollback;
The sleep gets us past the time when pg_repack is setting up, and happens while it is is doing its CREATE TABLE ... AS SELECT .... When that CREATE TABLE statement finishes, both sessions will be wedged.  Session 2 will be hung because it is unable to lock the table, since pg_repack's other session will hold a weak lock on the table. And nothing, including pg_repack, will be able to do anything with the table.

The solution is to make sure that nothing holds or even tries to obtain any strong long running locks on the table.

One useful thing is to use the check_postgres.pl monitor script to look for things like long running transactions and processes waiting for locks.

Or you can create a more customized test to look for this exact situation.

Most importantly, you need to be aware that problems can occur, and to protect against them happening in the first place.





5 comments:

  1. As an alternative you can use https://github.com/grayhemp/pgtoolkit, it's more like autovacuum for bloat (load adjustable, lock friendly, assumed to be used in cron monthly or weekly) when pg_repack is more like vacuum full.

    ReplyDelete
    Replies
    1. I'll take a look at it. But first I will have to understand how it works.

      Delete
  2. Andrew: what version of pg_repack are you using, and are you actually using pg_repack as you say in the title, or the old pg_reorg as your example command suggests?

    I tried your test on pg_repack head (the 1.2 branch of *pg_repack*, not necessarily the old pg_reorg project, should have the improved locking behavior). I verified that the 'LOCK TABLE' command in session 2 does hang around in 'waiting' state as you say. This is because pg_repack is intentionally holding an ACCESS SHARE lock on the original table while it is building its fresh version, so the ACCESS EXCLUSIVE-mode lock you attempt to acquire with 'LOCK TABLE foo;' must wait its turn. pg_repack insists on holding this ACCESS SHARE lock so that no miscreant may attempt DDL changes on the original table while it is working and possibly cause corruption when pg_repack goes to perform its filenode swap. However, neither session should be "wedged" -- session 1 is simply waiting to acquire its ACCESS EXCLUSIVE LOCK, which should happen gracefully when pg_repack finishes.

    ReplyDelete
  3. I'm using pg_repack, not pg_reorg. pg_repack won't in fact finish. It blocks on this statement, after copying the actual data:

    SELECT indexrelid, repack.repack_indexdef(indexrelid, indrelid, $2, FALSE) FROM pg_index WHERE indrelid = $1 AND indisvalid;

    which in turn appears to block on some internal function, I'm not sure yet which.

    So then we have session 2 waiting for pg_repack to finish, and session 1 running pg_repack with one of its sub-sessions waiting for session 2. That's pretty much deadlock.

    ReplyDelete
  4. For the benefit of any late visitors to this post, Andrew was kind enough to put together a fix for this issue which is in the latest versions of pg_repack (1.3.0+).

    https://github.com/reorg/pg_repack/commit/db4ec04cf25eb7cf77138fb256f318834963d0d0

    Thanks Andrew!

    ReplyDelete