Yedhin
← blog

Making NeetoCal feel fast

Notes from a year of chasing timeouts on a calendar product that handles millions of records.

One recurring Google Calendar event was scheduled out to December 23rd, 2699.

That was one of the stranger things I found while spending a year making NeetoCal faster.

NeetoCal is a Calendly-style scheduling product. People share a public link and let others pick a time on their calendar. When a visitor clicks "show me available slots," they expect that page to load quickly. That moment is also where the visitor decides whether to bother with you at all.

For a stretch in 2025, that page would sometimes just hang for thirty seconds and then die. Our error tracker would log a Rack::Timeout::RequestTimeoutError, which is the app's way of saying "this request took longer than thirty seconds, I'm killing it before it starts hurting other users." For the customer on the other end, the page was simply broken.

This is the story of how I went after that, the patterns I kept relying on, and a 28.6 million row data leak I tripped over along the way. I don't expect you to know PostgreSQL or Rails before reading. I'll explain things as I go.

Where slowness hides

Before fixing anything, you have to know what's slow. We had three sources of truth.

Honeybadger catches errors. Every time the app threw a 30-second timeout, we got a stack trace pointing at the line that was running when the time ran out.

Datadog is an application performance monitor. It watches a request from start to finish and shows you a tree: the controller called this service, the service ran fourteen SQL queries, one of them took nine seconds, the next took six. You see at a glance where the time went.

EXPLAIN ANALYZE is PostgreSQL's own confession. You hand it your query and it runs the query and tells you exactly how it found the rows. Did it scan the whole table? Did it use an index? How many rows did it touch? How many of those did it actually need?

The flow was always the same. Find a timed-out request in Honeybadger. Pull up the same trace in Datadog. Spot the slow query. Run EXPLAIN ANALYZE against a copy of production data and figure out why.

The biggest single win

Here's the part of the product where the worst slowness was happening.

NeetoCal asks Google Calendar (and Outlook, iCloud, and a couple of others) to send us your events. We store those events in our database. When someone tries to book a slot with you, we have to ask: "Does this person already have an event between 2pm and 3pm on Tuesday?" That's a conflict check.

A conflict check is just a query. Find me all events on this calendar that overlap with this time range. Simplified, ours looked like this:

SELECT * FROM google_calendar_events
WHERE calendar_id = 42
AND COALESCE(start_time, original_start_time) < '2025-08-22 15:00'
AND COALESCE(end_time,   original_end_time)   > '2025-08-22 14:00'

COALESCE is "use the first column, but fall back to the second one if the first is null." We had it because some events had nullable start and end times, leftover from earlier sync code.

At the point where I was fixing this path, our google_calendar_events table had about four million rows. Each conflict check was costing us 281,139 cost units (Postgres's own estimate of work) and around 14 milliseconds with parallel scanning turned on. For a single check that sounds fine. Multiply it by every available slot on every booking page and the bill gets ugly fast.

Why it was slow

If you're a beginner, the one thing to know about a database index is this. An index is a separate, sorted copy of one or two columns from your table. When the database needs to find rows by start_time = X, it has two options. Either scan all four million rows and check each one. Or look in the index, which is sorted, and find the value the same way you'd find a word in a dictionary.

The catch is that a normal index covers column values, not expressions you write at query time. The moment you write WHERE COALESCE(start_time, original_start_time) < ..., the database has to compute COALESCE(...) before it can decide whether a row matches. Unless you build a matching expression index, you're back to scanning the entire table.

That's why every conflict check was reading the whole table.

The fix

Two parts.

First, get rid of the reason COALESCE was there. We had a small population of rows with nil start and end times left over from older sync logic. I cleaned them up, then made those fields non-null going forward. Now the query could compare start_time and end_time directly.

Second, switch to a Postgres feature called a range. Since these timestamps were stored without timezone, the specific type was tsrange. A range is a value that represents an interval like "2pm to 3pm." Postgres has a built-in operator for "do these two ranges overlap?" written as &&. The query became:

SELECT * FROM google_calendar_events
WHERE calendar_id = 42
AND tsrange(start_time, end_time) && tsrange('2025-08-22 14:00', '2025-08-22 15:00')

That reads almost like English. "Find events on calendar 42 whose interval overlaps the interval 2pm to 3pm."

Then I added an index designed for exactly this. Conceptually, it covered the calendar and the time range. GIST is a kind of index Postgres has specifically for ranges and other "does this overlap with that?" data. It's built to answer overlap queries directly. I also made the index a filtered one: only events that were confirmed and actually counted as busy made it in. About a third of the rows were just clutter from the index's perspective, so they didn't need to be there.

The result, on the same query against the same data:

  • Query cost: 281,139 → 131
  • Execution time: 13.9 ms → 1.5 ms

A 99.95% reduction in cost. The index lookup was direct: it knew which rows could possibly overlap and went straight to them.

A small but important detail for anyone doing this in production. Building this index on a 3 GB table while live traffic was hitting it was scary. I used CREATE INDEX CONCURRENTLY, which builds the index without locking writes, and bumped two Postgres settings (maintenance_work_mem and max_parallel_maintenance_workers) inside the session so the build wouldn't take all night. The run took the better part of an hour. I picked a Sunday.

Pattern: don't make Ruby do work the database can do

The next pile of fixes had a shape in common. We were loading lots of rows into Ruby, looping over them, and computing things that the database could have computed at query time.

The clearest example was the slot-generation logic that powered the public booking page. For every candidate slot on the page (one customer's page had 291 of them across a week), the validator was doing a linear scan over 1,419 in-range bookings, comparing TimeWithZone objects.

TimeWithZone is Ruby's timezone-aware time class. Comparing two of them over and over is not free. Each comparison carries object overhead, timezone normalization, and date math before it can say "yes" or "no." For a customer with 35,000 bookings, this turned into 412,929 timezone comparisons per request. The whole request took 13.7 seconds. The browser hung. The user gave up.

The fix was twofold. First, precompute. Convert all booking start and end times into plain integers (seconds since 1970, in UTC) once, when we load them. Then comparing two times is just comparing two integers. Second, binary search. The array is sorted by start time, so binary search lets us jump to the part of the list where overlaps could exist instead of starting at the top every time. From there, the remaining checks are plain integer comparisons: starts before the slot ends, ends after the slot starts. On the larger benchmark, this meant about thirteen integer comparisons per slot instead of scanning the whole list.

After:

  • Comparisons per request: 412,929 → 2,104
  • Validation time on the larger benchmark: 12,171 ms → 5 ms
  • End-to-end page time: 13.7 s → sub-second

The lesson generalises. A slow Ruby loop is rarely a "make Ruby faster" problem. It's a "do less work, in better data structures" problem. Often that means moving work into the database.

A few smaller versions of the same pattern showed up everywhere.

The total-counts box on the bookings list page used to load all the bookings into memory and call .count per status. The fix was a single SQL query using COUNT(*) FILTER (WHERE status = ...) syntax. Postgres returned all the counts in one pass. Sub-600 ms for the customer with 250,000 bookings.

A polling endpoint that ran every two seconds was eager-loading eleven nested associations on each call, just to read a single Redis cache key at the end. The endpoint didn't actually need any of those associations. Strip them, and a chunk of database load disappears.

There was even a sneaky variant. We had a place where preloaded data was being filtered through a Rails scope, which silently rerun the database query because Rails scopes don't operate on already-loaded arrays. Replacing the scope with a plain Ruby filter (the data was already in memory, after all) was a one-line change with a measurable speedup. Easy to miss in code review.

The 28.6 million row data leak

This one was the most fun.

Later, while working through the data-retention side of the same problem, our google_calendar_events table had grown to 27 GB. The team noticed because pganalyze (a database monitoring service) flagged it. I opened the investigation issue with one line: "Like I want to try and understand how it's right now at 27GB. That's a massive amount of data."

I pulled the data composition. Out of 28.6 million rows in that table, the shape looked like this. Some of these buckets overlapped, so the numbers don't add up neatly, but they told us where the waste was:

  • 2,548 rows (0.01%) were actual NeetoCal bookings.
  • 18.5 million rows were events for calendars where the user had explicitly turned off "check for meeting conflicts" in their settings. We were syncing them anyway. They served zero purpose.
  • 17.2 million rows were events scheduled in the future. The furthest single event was scheduled for December 23rd, 2699.

Why the year 2699? Recurring events. If you create a Google Calendar event with no end date ("Daily standup, every weekday, forever"), Google will happily generate instances of that recurrence as far into the future as you ask. We were asking for far too many. A single recurrence rule could spawn thirty thousand instances. A few unlucky orgs had recurrence rules that, between them, came to over a million events. One org alone had 1,154,485.

The fix wasn't a single change. It was a chain of them spread over about a month.

Stop the inflow. Add guards so we don't sync events for calendars the user has disabled. Cap how far into the future we'll generate recurring instances. If we're not going to use those events, don't fetch them.

Clean up what's already there. A pruning job that walks the table in chunks and removes obvious garbage: events on disabled calendars, events past a sane future cutoff, ghost events left behind after a Google Calendar move.

One-time global pass. A one-shot job that runs the pruning logic across every org so the table size drops in one motion.

The cleanup itself was fiddly. The first version was a single big job, but the table was too large to walk in one pass without holding locks too long. I split it into a fan-out of smaller jobs running in parallel. Then we hit deadlocks because the parallel jobs were racing for the same rows. I pulled it back to serial. Production engineering rarely flatters your first design.

The end state: the table became a small fraction of its previous size, the inflow was capped so it can't run away again, and one customer's worst-affected user went from seeing 18 phantom events on their day down to 11 real ones. That last bit was the part that mattered. The 27 GB figure is a vanity number. "I see meetings in my calendar that don't exist anymore" is the part the user actually feels.

What I keep reaching for

A handful of patterns came up over and over.

  1. The slow query is rarely the slow query. A request that times out at 30 seconds usually has one or two queries doing most of the work. Find them in your application monitor. Run EXPLAIN ANALYZE on a production-shaped copy. Read the plan.
  2. Don't wrap indexed columns in functions. WHERE COALESCE(start_time, ...) defeats your index. So does WHERE LOWER(email) = ... unless you have a matching expression index. Plain column references are what indexes love.
  3. Range overlap is its own problem. Postgres ranges and GIST indexes are the right tools when "things that overlap" is what you're looking for. Calendars, bookings, sessions, anything with a start and an end. Once you know about it, you start spotting it everywhere.
  4. Aggregations belong in SQL. If your data is shaped like "group these rows by something and count," that's a SQL query, not a Ruby loop. COUNT(*) FILTER (WHERE ...) is your friend.
  5. exists? is not present?. present? loads everything to ask "is this empty?" exists? runs SELECT 1 ... LIMIT 1. They look the same to you. They don't to the database.
  6. Polling endpoints are special. If something is called every two seconds, every association you eager-load is paid thirty times a minute per user. Strip ruthlessly.
  7. Look at your data composition. Sometimes the bug isn't in the query. It's that 99.99% of the rows shouldn't be there. A bloated table makes every query worse.
  8. Build production indexes concurrently. On a Sunday.

Final thoughts

None of this is glamorous work. There's no new-feature ribbon-cutting at the end of a perf sprint. The win is that, on one of NeetoCal's largest customer accounts, the public booking page went from timing out to loading more than 90% faster. That's the difference between a visitor giving up and a booking actually happening.

But the muscle you build doing it is hard to fake. You learn to read query plans. You learn to feel the difference between "this query is slow" and "this whole code path is shaped wrong." You learn to be patient with the part where the table has 28.6 million rows, a lot of which shouldn't be there, and someone has to figure out why.

If you're a beginner and any of this felt like it was reaching past you, the one thing to take away is this. When something is slow, the first move is measurement. Find out what's actually slow before you guess. Application traces from your monitor of choice, EXPLAIN ANALYZE for queries, real production-shaped data to test against.

Once you know what's slow, the fixes get a lot more obvious.