How 2 SQL statements saved us over $150/mo in hourly backups

Every once in a while you get to pull a rabbit out of the hat. It’s always nice to celebrate those wins, and help others do the same. :)

At Noko, we were running into a problem: our backups were getting too large. We’d been getting along fine, but the writing was on the wall: Get this fixed ASAP or you’ll be paying for it very soon.

We brainstormed the problem a bit, I volunteered to get it fixed, and got to working.

Our initial solution was totally wrong (which tends to happen when optimizing something); but after I did the real work of researching, I ended up with a 2-line SQL script that:

isn’t a pile of nightmarish hand-written SQL
reduced the size of our backups by 70%
was non-blocking, meaning users didn’t notice a thing

Honestly, I’m pretty damn proud with how it all turned out. It’s not often you get all 3 when doing major optimization.

I’m not so proud of trying to be being too clever. I’ve learned the hard way that you waste a lot of time doing that, but apparently I needed to relearn that lesson. And I’m telling you all this because I want to help you avoid the same mistake.

Always do the research

When we first recognized the problem, we came up with the “logical” solution:

Okay, let’s get rid of any leftover data from old deleted accounts. That should cut down the size pretty significantly.

Did this make “sense”? Sure. We’re a 7+ year old app, there’s bound to be artifacts from deleted accounts, and that might have piled up over time.

So, I spent 2 days writing, testing, and fretting about how to find and delete those records. If I screwed up, paying customers might lose their data, and that’s terrifying.

But, I persevered! I wrote the script so I could test the hell out of it, then refactored like mad to improve the performance.

You can probably guess where this story’s going… when it was all said and done, I regained a whopping, monumental 2 GB, which would only shave a few megabytes off the backup.

In hyper technical terms: I spent 2 days writing a script that did diddly squat to solve the problem.

The solution

I stepped away from the problem for a few days, then dove back in.

The first thing I did was figure out exactly what was the database bloat. The PostgreSQL wiki has a fantastic article of figuring out disk usage, which has the following script to find the 20 biggest relations in your database:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 20;

Which netted some very interesting results

relation	size
pg_toast.pg_toast_1234567	11 GB
…	3137 MB
…	1702 MB
…	406 MB
…	294 MB
…	294 MB
…	210 MB

So, the pg_toast table that’s over 3 times larger than any other table in the database is a TOAST table, a part of Postgres’s internals. There’s a whole section of the PostgreSQL manual about TOAST tables, but the long and short of it is that TOAST tables are paging tables for storing long-form fields in a particular table.

With a query from the following Stack Overflow question, I could figure out which table was the culprit:

select t1.oid, t1.relname, t1.relkind, t2.relkind, t2.relpages, t2.reltuples
from pg_class t1
inner join pg_class t2
on t1.reltoastrelid = t2.oid
where t1.relkind = 'r'
  and t2.relkind = 't'
  and t1.oid = 1234567;

oid	relname	…
1234567	the_spanish_inquisition	…

What?! No one expected The Spanish Inquisition!

It turns out there was a tiny bit of code that meant The Spanish Inquisition was recording all its actions. And I mean all of them. Useful for maybe 20% of the things they do, but it was clearly taking up way too much space (as inquisitions tend to do).

And why did no one expect The Spanish Inquisition? Because almost all of its records were hidden away! (I’m surprised at how well this bit’s playing out)

The junk records, a whopping 80% of the data, was orphaned. It wasn’t tied to anything else in the app. We never noticed it because nothing was missing. Like an iceberg, the bulk of the problem was hidden.

Now we had a straightforward fix

-- delete all the orphaned records
DELETE FROM the_spanish_inquisition WHERE accused_id IS NULL AND accuser_id IS NULL
-- do a non-blocking cleanup of the table
VACUUM ANALYZE the_spanish_inquisition

Boom. I saved the company $150/mo in storage costs and reduced the backup size by 70% with a zero-downtime fix.

The takeaway

While it’s awesome to come up with fixes like this, I hate that I had to relearn some really important lessons:

Do your research before you even think about writing a line of code. I wasted 2 days because I didn’t do 10 minutes of research. No matter how busy you are, you should always understand the problem before you try to fix it.
Don’t just take someone’s word for an optimization, back it up with some data. While our initial idea would save some space, we were clearly barking up the wrong tree.
There aren’t any bonus points for picking the hard or risky route and diving in. Make sure you know you’re doing the safest, easiest thing that will have the highest impact.
Be cognizant of your work. I realized the script was too much hassle, so I scrapped it to make sure I did the job right.