Date: May 13, 2026 Duration: ~57 minutes Severity: Major
On May 13, 2026, publishing posts to YouTube began failing for customers. For about 17 minutes, every YouTube post attempted failed, and success rates were degraded for a period on either side. The cause was a temporary problem on YouTube's side, and the service recovered on its own.
Customers who scheduled or tried to publish YouTube posts during the incident window saw their posts either fail or be significantly delayed. For roughly 17 minutes, all YouTube posting failed. Posts that failed during the window were subsequently retried. Other channels were unaffected.
The cause was a temporary problem with YouTube's video-upload service, not anything in our own systems. Upload requests to YouTube began timing out — YouTube simply wasn't responding to them in time. Several signals confirmed the issue was on YouTube's side: the problem affected only YouTube and no other Google services, our own systems showed no errors or unusual behaviour, and the service recovered without any change or intervention from us. A self-healing recovery like this is the signature of an upstream platform issue rather than an internal one.
We confirmed through our logs and metrics that the failures were YouTube-side, posted a public status page update, and activated our customer support workflow. We monitored recovery using both our dashboards and real posting tests, and once YouTube publishing had been stable at full success rates for a sustained period, we marked the incident resolved. Because the root cause was a temporary YouTube issue, no code or infrastructure change was needed — the service recovered on its own.
We can soften the impact of third-party outages even when we can't prevent them. Adding automatic retries with backoff for YouTube posting would let posts succeed through short-lived upstream problems without anyone needing to step in.
Error messages should be friendlier. When a post fails because of an upstream issue, customers should see a clear message rather than a raw technical error.
We need earlier automated detection. We currently spot these drops by manually checking dashboards. Alerting when a platform's posting success rate drops below a threshold would let us respond before customers report problems.
This is a recurring pattern. Several recent incidents have come from third-party platform instability. A standard runbook for third-party outages would streamline our response each time.
What worked. Engineering, Advocacy, and Communications coordinated well — customer reports, dashboard data, and real posting tests were combined to confirm both the cause and the recovery, and customer communication went out quickly.