Slowness and intermittent errors in Buffer

Write-up

Incident Report: Core API Slowdown and service regressions

Date: May 18 to May 20, 2026

Duration: ~21 hours of degraded performance across two days

Severity: Major

Between May 18 and May 20, Buffer experienced widespread slowness and intermittent errors across the web app, our iOS and Android apps, the public API, and MCP. Pages loaded slowly, posts took longer to publish, and connecting social accounts sometimes required several attempts. The platform stayed usable throughout, just slower than it should have been. No customer data was lost.

This write-up explains what happened, how it affected you, how we resolved it, and what we are changing to prevent it from happening again.

Root Cause

The incident came from three separate changes that landed close together and amplified one another. None of them would have caused a serious outage on its own, but together they created a feedback loop that overloaded our Core API, the central service that nearly all requests pass through.

A backend authorization change added extra processing to many requests. As part of ongoing work to enforce stricter access checks across the product, a new code path added a small amount of in-process work to a large number of requests. Each request was cheap on its own, but at our traffic volume the combined work overwhelmed the single-threaded part of our Core API. The result was that almost every request slowed down, and operations that depend on it (database and cache calls) appeared slow because they were waiting in a backlog rather than because anything was wrong with them.
An automated code cleanup introduced a loop in our Analyze frontend. A large, mostly mechanical formatting update swapped one array operation for a similar one that behaved slightly differently. That difference caused the Analyze app to repeatedly request the same data in a tight loop, flooding the Core API with redundant traffic. The change was buried among thousands of routine edits, which made it hard to catch in review.
A frontend bug in our comments feature caused repeated data reloads. A separate change made the app reload comment counts far more often than necessary, adding still more load to the Core API.

A complicating factor ran through all three: even after we deployed fixes, many people's browsers kept running older cached versions of the code, so the extra traffic continued until we found a way to force those sessions to pick up the corrected code.

Customer Impact

Response times rose from around 50ms to several seconds, with some individual requests timing out near 22 seconds.
Every customer surface was affected: Buffer Web, iOS, Android, the public API, and MCP.
People saw slow page loads, delayed post uploads, and occasional errors.
Connecting social accounts often required multiple retries.
Our internal support tooling was also degraded, which slowed some customer responses.
No customer data was lost at any point.

Impact began on May 18 and was fully resolved by May 20 at 15:42 UTC. We applied an overnight mitigation in between so that the platform stayed at normal speed while we kept working on the underlying fixes.

Steps to Resolution

Resolving this took roughly a day of investigation and mitigation across our backend, frontend, and infrastructure teams.

Diagnosis. We correlated our monitoring data to confirm the slowdown was caused by application-level processing on the Core API rather than a database or cache problem. That pointed us at the recent authorization changes.
Backend fixes. We shipped a series of changes to reduce the per-request work introduced by the authorization update, bringing latency back toward normal levels.
Overnight scaling. To keep the platform fast while we continued working, we temporarily added capacity to the Core API. This held performance steady through the night, and we scaled back down once the underlying fixes were confirmed.
Analyze loop fix. We identified and fixed the loop in the Analyze frontend. Because some browsers were still running the old version, we also adjusted the affected endpoint so that outdated clients could no longer generate runaway traffic.
Comments traffic mitigation. We reverted the comments change, paused the high-volume updates feeding the loop, and invalidated the sessions generating the most excess traffic.
Forcing stale clients to update. We tagged frontend requests with a version identifier and added an edge rule to block requests from outdated clients, which prompted those browsers to reload and pick up the fixed code. This drove the remaining excess traffic to zero.

By May 20 at 15:42 UTC, traffic and error rates were back to normal across both customer-facing and internal tools, with no new reports of slowness.

Key Learnings

We needed faster alerting. The slowdown started about 24 hours before we formally responded. Our monitoring detected the underlying signal, but it was not wired to alert anyone. We are adding alerts on request latency, traffic spikes, and the specific internal metric that first showed the strain.
Large automated changes need extra scrutiny. The most damaging bug was a one-line behavioral change hidden inside an otherwise mechanical update touching thousands of files. We are tightening how we review and guard against this class of change.
Performance needs to be validated under realistic load. The authorization change passed review, but its cumulative effect only showed up at production traffic volumes. We are building performance checks that better reflect real conditions before changes go live.
Stale browser code is a real risk. Several of the issues persisted because deploying a fix does not force everyone's browser to use it. The version-tagging and edge-blocking approach we used to resolve this is being turned into a standard, documented tool for future incidents.