TL;DR: We had a major outage. We wanted to increase stability. The obvious way is to increase process & testing, but a better way is to ship more frequently.
The debate: To write tests or not to write tests?
At Remotion, whether or not to write tests is left up to the individual engineer’s discretion. Basically: “Will tests help you ship this feature faster?” There are some obvious places where tests help (e.g. server functions) and obvious places where they’re hard (e.g. integrations). But there’s a wide in-between space, and the the team frequently debates the right level of testing.
This resurfaced recently when we had a major user-facing outage. Our unexpected conclusion: instead of increasing stability by writing more tests, increase stability by shipping more frequently.
Our mistake: Shipping multiple changes to a complex system, all at once
We recently prepared for a product launch with large and risky changes. The changes were large and risky. In our rush to release, commits were coming in fast and the main branch was never quite stable enough to deploy. We ended deploying all the changes at once, under time pressure near the end of our release window.
During testing in our staging environment, we’d noticed some issues: Certain app interactions felt slow, and our Slack integration fired some duplicate notifications. After investigation, the issues seemed unrelated to our changes. We chalked the issue up to temporary Google Cloud Platform (GCP) or Slack server issues, and deployed.
Narrator: But everything was not fine
The next morning, we got user reports of “5+ repeated slack messages” and app slowdowns—the same issues we saw but discounted when testing. Our first response was to mitigate the most critical user facing issue of repeated Slack notifications. We went for simple and just disabled the feature. To our surprise, disabling Slack notifications also fixed our servers’ slow response times!
Root cause: Adding a retry
Yes, now that we’ve drawn this diagram the problem is obvious. 🤦
Turns out adding a retry to a commonly called function was the root cause. When a user joins a “room” in Remotion, we need to both access and mutate rooms in our database. This often happens in bursts, such as when many people simultaneously join a room for standup. The retry was an attempt to work around the resulting contention issues.
However, the problem is that this code also calls a Slack API mid-transaction. When we retried, Slack quickly used exponential backoff rate limiting. Resulting in failed, slow transactions. Resulting in even more retries. And loop.
Ultimately, the issue was a combination of unexpected behaviors from multiple systems interacting with each other. The underlying architecture was flawed, but it took a small, seemingly unrelated change to break it.
Preventing repeat issues without slowing down
In our retro, we discussed what changes we needed to make to prevent this from happening again: The obvious reaction was to write new tests and add more rigor to processes like code review. For this specific issue, the tests we’d need would be complex mocks of external systems. Expensive and difficult to build accurately.
More generally, we weren’t excited about increased testing and code review requirements. Startups win by moving fast, and these options push us away from “Speed” in the classic engineering triangle.
Instead, we aligned on a radical alternative approach: Improve reliability by speeding up shipping to users:
Our solution: Ship more to unbreak quickly
1. Make it easy to deploy quickly with automation
Deployments can easily involve painful manual steps, especially if you ship native code on Apple platforms like us. We’ve found investing in automation and simplification to be well worth it. The easier it is to ship, the more it happens.
2. Make it easy to deploy quickly by creating a culture of trust and followup
In the face of mistakes, process builds up like scar tissue. Most of that process is unnecessary—in fact it’s probably demotivating to your strongest performers. Instead of adding process, celebrate mistakes. Make it an opportunity to reinforce the level of trust across the team. And build a culture of following up on releases rapidly in response to metrics or feedback.
3. Ship small pieces instead of large blocks
Shipping a giant project all at once is harder than shipping smaller milestones. We all know it, but projects frequently become monolithic releases despite our best efforts. It happens to us at Remotion all the time! We don’t have any silver bullets for this, but it’s useful to remind ourselves. Plus, frequent deploys makes shipping milestones much more rewarding.
4. Write tests when they speed up development
I always tell my team: “Tests are not process. They are a developer tool.”
Thanks for reading
Although we write more tests than this post may lead you to believe, the recent outage was a great opportunity to reaffirm the culture of trust and ownership that we’re building at Remotion. Building it is both a learning process, and a work in progress.
I’d love your thoughts and feedback. Just email me at charley at remotion dot com.