Bluesky Service Disruption

Write-up

Incident Report: Bluesky Integration Instability

Date: April 16–17, 2026 Duration: ~22 hours Severity: Major

Summary Buffer's Bluesky integration experienced extended instability across publishing, analytics, and engagement features. Customers saw elevated posting failures (particularly for video content), comment ingestion delays, and inconsistent delivery throughout the incident window. The root cause was an upstream outage on Bluesky's platform, and the integration recovered as Bluesky's services stabilized.

Root Cause The incident was caused by a platform-wide outage on Bluesky's side, which Bluesky later attributed publicly to a sophisticated DDoS attack. The outage affected multiple Bluesky API endpoints, which Bluesky's status page reported as "Down" for the majority of the incident window. Because Buffer's Bluesky integration depends directly on these APIs for both publishing and ingesting content, the upstream instability cascaded into Buffer's publishing, comment ingestion, and analytics flows. There were no internal code changes, deployments, or regressions involved.

Customer Impact Approximately 707 organizations were affected during the incident. Around 2,000 posts failed to publish across roughly 780 unique accounts, with video posts hit hardest (a 100% failure rate early in the incident, recovering to around 68% mid-day before stabilizing). Image posts saw an elevated failure rate of around 20% early on, while text-only posts were generally more likely to succeed throughout. In Buffer's Community product, 53 comment replies failed to publish, affecting 18 users, and 15 channels were unable to backfill comments during reconnection cycles. Customers were notified via Buffer's public status page, in-product communications, and email follow-ups once the integration had recovered.

Steps to Resolution The team correlated Buffer's internal metrics with Bluesky's public status page within the first half hour of the incident, confirming the cause was upstream. Because resolution depended on Bluesky restoring their platform, the team's response focused on monitoring, customer communication, and continuous status updates rather than a code-level fix. A public status page incident was created early on, a Help Scout workflow was set up to route incoming customer reports, and the Customer Comms Lead role was rotated across multiple team members to maintain around-the-clock coverage during the long incident window. Buffer's integration recovered ahead of Bluesky's official "operational" status update, with both standard and video posting success rates returning to above 90% on the evening of April 16. The incident was formally resolved the following morning after a sustained period of stable metrics.

Key Learnings

Because the root cause was a third-party outage, there was no way to prevent the incident itself. However, we identified opportunities to detect this kind of upstream instability sooner. We're exploring automated monitoring of partner platform status pages, per-platform error rate thresholds (so platform-specific degradation surfaces faster when other integrations are healthy), and Firehose stream connection-health alerts that would flag prolonged disconnections or reconnection storms before they translate into visible publishing failures.
Bluesky's own status page experienced timeouts during the incident and reverted between recovery and investigating states, which made it harder to give customers an accurate picture of the upstream situation. For future third-party incidents, we plan to supplement official status pages with additional signals so our customer communications can stay accurate even when partner sources are unreliable.
Video posts were consistently the most affected content type, with failure rates well above text and image posts. We're investigating whether the video publishing path has additional dependencies or less graceful error handling that could be improved, so customers see a more resilient experience the next time an upstream platform degrades.
Our incident response process held up well across a long, multi-shift incident. The Comms Lead role rotated five times across team members and time zones, which kept customers informed throughout, and is a pattern we want to keep using for extended third-party incidents.