Date: April 30 – May 8, 2026 Duration: ~8 days Severity: Major
Between April 30 and May 8, 2026, comments on Threads posts were delivered to our system with long delays — sometimes more than 24 hours behind real time — and occasionally stopped arriving altogether. The cause was on our upstream partner's side. After fixes deployed on their end, real-time delivery was restored.
For about 8 days, customers with Threads channels experienced long delays in seeing their Threads comments, with periods where new comments didn't arrive at all. We enabled a backup process that covered comments on posts created in the past 7 days, but it wasn't real-time and didn't reach older posts. Once the underlying issue was fixed, the backlog cleared and all delayed comments came through.
The issue was architectural and sat with our upstream partner, not in our own systems. Comment delivery was capped at a fixed rate per minute and couldn't run in parallel, so after any interruption it had no way to catch up — it just resumed at the same steady pace. Combined with a retry mechanism that paused delivery for hours after even small errors, this caused the backlog to grow faster than it could be cleared.
We enabled our backup ingestion process as soon as the incident was declared to soften the impact, and ruled out causes on our side through investigation. We then escalated directly to the upstream engineering team with our findings, which led to two fixes on their end: shortening the retry pause after failures, and increasing the delivery rate. Real-time delivery was restored on May 8. The backup process was left running through the weekend as a precaution.
Direct engineering-to-engineering escalation was what unblocked this. Standard support channels weren't enough — sharing detailed findings directly with the upstream team led to a fix within days. We'll use this playbook for future third-party incidents.
This is a recurring pattern we're watching. Similar outages have happened a few times over the past year as event volumes have grown. We'll keep close watch until the upstream partner completes their planned reliability work.
We need earlier automated detection of delivery delays so we don't have to spot them manually.
The backup process helped, but its scope is limited. We'll look at whether it can cover more posts and stay ready as a long-term fallback.
What worked. Customer communications stayed continuous across the full 8 days through a rotating Comms Lead across time zones, and the team's investigation was thorough enough to confidently rule out our own systems and focus the escalation externally.