My ugly mug

Hi, I'm Thomas Cannon!

I work on Freckle with Amy Hoy and Thomas Fuchs, and I teach people how to add an extra layer of polish to their work. Previously I worked for BMW and Clemson University, and have a bunch of smaller apps I tinker with on the side. Follow me on Twitter!

How 2 SQL statements saved us over $150/mo in hourly backups

Every once in a while you get to pull a rabbit out of the hat. It’s always nice to celebrate those wins, and help others do the same. :)

At Freckle, we were running into a problem: our backups were getting too large. We’d been getting along fine, but the writing was on the wall: Get this fixed ASAP or you’ll be paying for it very soon.

We brainstormed the problem a bit, I volunteered to get it fixed, and got to working.

Our initial solution was totally wrong (which tends to happen when optimizing something); but after I did the real work of researching, I ended up with a 2-line SQL script that:

Honestly, I’m pretty damn proud with how it all turned out. It’s not often you get all 3 when doing major optimization.

I’m not so proud of trying to be being too clever. I’ve learned the hard way that you waste a lot of time doing that, but apparently I needed to relearn that lesson. And I’m telling you all this because I want to help you avoid the same mistake.

Always do the research

When we first recognized the problem, we came up with the “logical” solution:

Okay, let’s get rid of any leftover data from old deleted accounts. That should cut down the size pretty significantly.

Did this make “sense”? Sure. We’re a 7+ year old app, there’s bound to be artifacts from deleted accounts, and that might have piled up over time.

So, I spent 2 days writing, testing, and fretting about how to find and delete those records. If I screwed up, paying customers might lose their data, and that’s terrifying.

But, I persevered! I wrote the script so I could test the hell out of it, then refactored like mad to improve the performance.

You can probably guess where this story’s going… when it was all said and done, I regained a whopping, monumental 2 GB, which would only shave a few megabytes off the backup.

In hyper technical terms: I spent 2 days writing a script that did diddly squat to solve the problem.

The solution

I stepped away from the problem for a few days, then dove back in.

The first thing I did was figure out exactly what was the database bloat. The PostgreSQL wiki has a fantastic article of figuring out disk usage, which has the following script to find the 20 biggest relations in your database:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 20;

Which netted some very interesting results

relation size
pg_toast.pg_toast_1234567 11 GB
3137 MB
1702 MB
406 MB
294 MB
294 MB
210 MB

So, the pg_toast table that’s over 3 times larger than any other table in the database is a TOAST table, a part of Postgres’s internals. There’s a whole section of the PostgreSQL manual about TOAST tables, but the long and short of it is that TOAST tables are paging tables for storing long-form fields in a particular table.

With a query from the following Stack Overflow question, I could figure out which table was the culprit:

select t1.oid, t1.relname, t1.relkind, t2.relkind, t2.relpages, t2.reltuples
from pg_class t1
inner join pg_class t2
on t1.reltoastrelid = t2.oid
where t1.relkind = 'r'
  and t2.relkind = 't'
  and t1.oid = 1234567;
oid relname
1234567 the_spanish_inquisition

What?! No one expected The Spanish Inquisition!

It turns out there was a tiny bit of code that meant The Spanish Inquisition was recording all its actions. And I mean all of them. Useful for maybe 20% of the things they do, but it was clearly taking up way too much space (as inquisitions tend to do).

And why did no one expect The Spanish Inquisition? Because almost all of its records were hidden away! (I’m surprised at how well this bit’s playing out)

The junk records, a whopping 80% of the data, was orphaned. It wasn’t tied to anything else in the app. We never noticed it because nothing was missing. Like an iceberg, the bulk of the problem was hidden.

Now we had a straightforward fix

-- delete all the orphaned records
DELETE FROM the_spanish_inquisition WHERE accused_id IS NULL AND accuser_id IS NULL
-- do a non-blocking cleanup of the table
VACUUM ANALYZE the_spanish_inquisition

Boom. I saved the company $150/mo in storage costs and reduced the backup size by 70% with a zero-downtime fix.

The takeaway

While it’s awesome to come up with fixes like this, I hate that I had to relearn some really important lessons: