My ugly mug

Hi, I'm Thomas Cannon!

I work on Freckle with Amy Hoy and Thomas Fuchs, and I teach people how to add an extra layer of polish to their work. Previously I worked for BMW and Clemson University, and have a bunch of smaller apps I tinker with on the side. Follow me on Twitter!

How 2 SQL statements saved us over $150/mo in hourly backups

Every once in a while you get to pull a rabbit out of the hat. It’s always nice to celebrate those wins, and help others do the same. :)

At Freckle, we were running into a problem: our backups were getting too large. We’d been getting along fine, but the writing was on the wall: Get this fixed ASAP or you’ll be paying for it very soon.

We brainstormed the problem a bit, I volunteered to get it fixed, and got to working.

Our initial solution was totally wrong (which tends to happen when optimizing something); but after I did the real work of researching, I ended up with a 2-line SQL script that:

  • isn’t a pile of nightmarish hand-written SQL
  • reduced the size of our backups by 70%
  • was non-blocking, meaning users didn’t notice a thing

Honestly, I’m pretty damn proud with how it all turned out. It’s not often you get all 3 when doing major optimization.

I’m not so proud of trying to be being too clever. I’ve learned the hard way that you waste a lot of time doing that, but apparently I needed to relearn that lesson. And I’m telling you all this because I want to help you avoid the same mistake.

Always do the research

When we first recognized the problem, we came up with the “logical” solution:

Okay, let’s get rid of any leftover data from old deleted accounts. That should cut down the size pretty significantly.

Did this make “sense”? Sure. We’re a 7+ year old app, there’s bound to be artifacts from deleted accounts, and that might have piled up over time.

So, I spent 2 days writing, testing, and fretting about how to find and delete those records. If I screwed up, paying customers might lose their data, and that’s terrifying.

But, I persevered! I wrote the script so I could test the hell out of it, then refactored like mad to improve the performance.

You can probably guess where this story’s going… when it was all said and done, I regained a whopping, monumental 2 GB, which would only shave a few megabytes off the backup.

In hyper technical terms: I spent 2 days writing a script that did diddly squat to solve the problem.

The solution

I stepped away from the problem for a few days, then dove back in.

The first thing I did was figure out exactly what was the database bloat. The PostgreSQL wiki has a fantastic article of figuring out disk usage, which has the following script to find the 20 biggest relations in your database:

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_relation_size(C.oid)) AS "size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
  ORDER BY pg_relation_size(C.oid) DESC
  LIMIT 20;

Which netted some very interesting results

relation size
pg_toast.pg_toast_1234567 11 GB
3137 MB
1702 MB
406 MB
294 MB
294 MB
210 MB

So, the pg_toast table that’s over 3 times larger than any other table in the database is a TOAST table, a part of Postgres’s internals. There’s a whole section of the PostgreSQL manual about TOAST tables, but the long and short of it is that TOAST tables are paging tables for storing long-form fields in a particular table.

With a query from the following Stack Overflow question, I could figure out which table was the culprit:

select t1.oid, t1.relname, t1.relkind, t2.relkind, t2.relpages, t2.reltuples
from pg_class t1
inner join pg_class t2
on t1.reltoastrelid = t2.oid
where t1.relkind = 'r'
  and t2.relkind = 't'
  and t1.oid = 1234567;
oid relname
1234567 the_spanish_inquisition

What?! No one expected The Spanish Inquisition!

It turns out there was a tiny bit of code that meant The Spanish Inquisition was recording all its actions. And I mean all of them. Useful for maybe 20% of the things they do, but it was clearly taking up way too much space (as inquisitions tend to do).

And why did no one expect The Spanish Inquisition? Because almost all of its records were hidden away! (I’m surprised at how well this bit’s playing out)

The junk records, a whopping 80% of the data, was orphaned. It wasn’t tied to anything else in the app. We never noticed it because nothing was missing. Like an iceberg, the bulk of the problem was hidden.

Now we had a straightforward fix

-- delete all the orphaned records
DELETE FROM the_spanish_inquisition WHERE accused_id IS NULL AND accuser_id IS NULL
-- do a non-blocking cleanup of the table
VACUUM ANALYZE the_spanish_inquisition

Boom. I saved the company $150/mo in storage costs and reduced the backup size by 70% with a zero-downtime fix.

The takeaway

While it’s awesome to come up with fixes like this, I hate that I had to relearn some really important lessons:

  • Do your research before you even think about writing a line of code. I wasted 2 days because I didn’t do 10 minutes of research. No matter how busy you are, you should always understand the problem before you try to fix it.
  • Don’t just take someone’s word for an optimization, back it up with some data. While our initial idea would save some space, we were clearly barking up the wrong tree.
  • There aren’t any bonus points for picking the hard or risky route and diving in. Make sure you know you’re doing the safest, easiest thing that will have the highest impact.
  • Be cognizant of your work. I realized the script was too much hassle, so I scrapped it to make sure I did the job right.

Pagination Perils: The Case of The Inconsistent Ordering

A few weeks ago we got a really interesting bug report from one of our API clients.

Hey, we’ve been playing around with the Invoice API and we’re running into this bug where the results vary depending on the number of items we request per page. Sometimes the order gets switched around, and it looks like an item is missing?

This was definitely weird, and I jumped on it right away. Missing records and non-deterministic orders are no joke, especially with invoicing.

Digging in the code, I found how the app orders the API results:

ORDER BY invoice_date DESC

Which makes sense, the invoices are sorted by the date they were issues. But, here’s the tricky thing about invoices: you can send out multiple invoices in a single day. In fact, that’s what’s generally considered, “a damn good day”.

What happens when multiple items fit into the same sorting block?

Think about how an automatic coin sorter works. It sends the coin down a slide, trying to guide the coin into progressively bigger slots. Once it hits the matching slot, it falls into the coin sleeve.

This works because each coin is a different size, so what would happen if two different coins were the same size? You’d end up with a sleeve full of mixed coins, which isn’t very helpful.

Sorting algorithms work in a similar way: they take the data and try to fit it into the right slot, and at the end you get an ordered list.

What happened here is that I accidentally let different coins be the same size. I’d considered the ordering of invoices over time, but I’d forgotten about invoices for the same date!

Ensuring deterministic sorting on overlapping records

The best way to fix this bug is to add another filter condition that ensures deterministic uniqueness. In this case:

ORDER BY invoice_date DESC, id DESC

This guarantees the newest invoices will be returned first, even if there are multiple invoices issued on the same date.

What about the missing record?

The part of the bug report that tripped me up the most was the fact a record was missing. How could that be possible? It’s not like the query changed conditions between requests!

Because of the non-deterministic ordering, the overlapping records had a slightly different order each time the query ran. Since the pagination logic grabs a slice of an ever-changing pie, and the page break could fall on a date with multiple invoices, voila, missing record!

How did you miss this?

Since it’s a slight-edge case to run into a situation where a page break would happen on a date with multiple invoices, I’d never run into it in my manual testing.

Running into this bug taught me a few lessons:

  1. Always make sure to think about the edge cases, even on something as “simple” as chronological ordering. Think about what your data looks like in the real world
  2. It’s crucial to understand the underlying technology you’re using. Knowing SQL and SQL orders helped me diagnose this problem and fix it quickly, and now I know how to prevent it from happening in the future.
  3. Be prepared for rough patches and bugs. API v2 is in beta because there is stuff we overlooked and need to fix. Likewise, you gotta be prepared to diagnose and fix bugs as quickly as possible.

The takeaway

Whenever you’re sorting records by a column that could have an overlap between records (same name, email, date, dollar amount, etc.), always have a secondary sort to make the ordering deterministic.

Storing API credentials for a Rails app

Hey there!

If you’re just starting out with an API, you’ve probably tossed your API credentials somewhere as a class constant, like this:

class FreckleClient
  PERSONAL_ACCESS_TOKEN = "abcd12345"
  URL = "https://api.letsfreckle.com/v2"
  include HTTParty

  def get_entries
    self.class.get("#{URL}/entries", :headers => {
      "X-FreckleToken" => PERSONAL_ACCESS_TOKEN
    })
  end
end

This is completely valid for quickly hacking and trying things out. But once you’ve got a handle on APIs, you’ll want to start securely storing API keys for a few reasons:

  • Security: If your keys get checked into source control accidentally, they’re exposed and no longer a secret. Even if it’s a private repo, you’ll want to keep your keys closely guarded, since they’re the keys to the kingdom.
  • Flexibility: Hard-coded values in a codebase are a pain to maintain. Let’s say that you need to regenerate your API keys. You’d have to dig through the code for every mention of the API key. It would be so much easier if your credentials were decoupled from your app and could be updated independently of the codebase.
  • Isolation: It’s standard practice to keep your production API keys away from development. You might use different accounts, the service might give you different keys, or you might have registered the same app for each environment (e.g: “MyApp (Development)”). In any case, the API keys always need to be separate and only load for the right environment.
  • Automated Deployment: As automated deployment becomes even more popular, you need a solution that seamlessly fits into your deployment scripts without mucking up the app’s repository with unrelated code.

The sanest approach that solves all these problems is to load the API credentials as environment variables in the app, which are stored in environment-specific YAML files. Thomas Fuchs introduced me to this pattern, and I’ve used it in all my apps since. The version below is for Rails 3+, let me know if you need a Rails 2.3 LTS version!

The code

# config/application.rb

module YourApp
  class Application < Rails::Application
    ...

    # Try loading a YAML file at `./config/env.[environment].yml`, if it exists
    # Kudos to Thomas Fuchs (http://mir.aculo.us) for the initial implementation
    def load_env_file(environment = nil)
      path = Rails.root.join("config", "env#{environment.nil? ? '' : '.'+environment}.yml")
      return unless File.exist? path
      config = YAML.load(ERB.new(File.new(path).read).result)
      config.each { |key, value| ENV[key.to_s] = value.to_s }
    end

    # Load environment variables. config/env.yml contains defaults which are
    # suitable for development. (This file is optional).
    load_env_file

    # Now look for custom environment variables, stored in env.[environment].yml
    # For development, this file is not checked into source control, so feel
    # free to tweak for your local development setup. Any values defined here
    # overwrite the defaults in `env.yml`
    load_env_file(Rails.env)
  end
end

Now to add the YAML files store our credentials!

# config/env.yml

# Just a test value to show overwriting
a: "hello"
# config/env.development.yml

# Just a test value to show overwriting
a: "derp"

freckle_personal_access_token: "abcd12345"

Now when we load the app, we’ll see these values loaded as environment variables!

$ rails c
irb(main):002:0> ENV['a']
=> "derp"
irb(main):003:0> ENV['freckle_personal_access_token']
=> "abcd12345"

Next up is to make sure the environment-specific YAML files are never stored in source control:

# .gitignore

# Ignore the environment-specific YAML files
/config/env.*.yml

Finally, we change our FreckleClient class to use the new environment variable and we’re good to go!

class FreckleClient
  URL = "https://api.letsfreckle.com/v2"
  include HTTParty

  def get_entries
    self.class.get("#{URL}/entries", :headers => {
      "X-FreckleToken" => ENV['freckle_personal_access_token']
    })
  end
end

Benefits

This approach has a number of benefits for development, security, and automated deployment:

  • Each developer can set their own environment variables for the specific features they work on.
  • A default set of environment variables exists with env.yml, which can even be used as an example configuration file with dummy values.
  • Environment variables can be set as-needed and new environments can be painlessly setup. env.travis.yml could store all your environment variables for running tests on Travis.
  • Automated deployment scripts can generate their own env.*.yml files for the target environment. The app will seamlessly use these new changes and the app’s repo is not mucked up with automated deployment code.
  • Since production environment variables don’t need to be stored in the app’s repo, there’s less chance of an accidental commit.

Other Notes

  1. Each developer creates their own development API keys whenever possible. This stops devs from stepping on each other’s toes while working on a new feature or bugfixing.
  2. Production API keys are a closely guarded secret. A scrict “need to know basis” is the best approach here.
  3. Keep env.yml up-to-date as a template to make onboarding new devs easier. Seeing all the environment variables in one place makes it easier to get started.