All outages
2023-03-14
Reddit's Pi-Day Outage: a Kubernetes 1.24 upgrade takes the site down for 314 minutes
- Date
- 2023-03-14
- Duration
- ~5 hours (314 minutes)
- Root cause
- A routine Kubernetes upgrade from 1.23 to 1.24 silently removed the deprecated 'master' node label. Reddit's Calico networking ran route reflectors that were pinned to nodes carrying that label, so when the label disappeared the affected nodes lost their network routes and core service discovery collapsed. Monitoring shared the same naming dependency and kept reporting healthy while observing nothing.
- Services affected
- Entire site (web, mobile web, native apps); request routing and internal service discovery
Impact
Just after 19:00 UTC an engineer kicked off the upgrade and Reddit came to a near-total halt within minutes. The whole platform was unreachable for about five hours while engineers traced a hidden coupling between the removed Kubernetes label and the Calico route-reflector configuration. The fix ultimately required restoring the old node labels and rebuilding the cluster's networking, and Reddit published a detailed public postmortem ('You Broke Reddit: The Pi-Day Outage') describing the cascade.