- Date
- 2016-08-11
- Duration
- ~1.5 hours down, ~3 hours total recovery
- Root cause
- During a planned infrastructure migration, Reddit turned off its ZooKeeper server-synchronization system to roll out a new build with its cloud providers. Mid-upgrade, the package-management system noticed the change and automatically switched ZooKeeper back on, forcing the site offline for a refresh.
- Services affected
- Entire platform offline, followed by degraded performance from a cold cache
Impact
On August 11, 2016, Reddit went fully offline for about an hour and a half after an automated package manager unexpectedly re-enabled ZooKeeper in the middle of a migration. Even after the site came back, the empty caches meant servers ran abnormally for roughly three hours total. Reddit published a postmortem explaining the chain of events.