Context
A Friday evening, 10:47 PM. The Telegram alert fires: the primary datacenter is down. Not a server, not a rack — the entire DC. Cascading power failure, the UPS systems didn't hold.
28 MariaDB / MySQL instances, 3 Galera clusters, 2 ProxySQL. Everything is offline.
Timeline
| Time | Action |
|---|---|
| 22:47 | PmaControl alert — DC unreachable |
| 22:49 | OVH confirmation — electrical incident in DC |
| 22:51 | DNS failover to secondary DC |
| 22:54 | Galera bootstrap on the surviving node |
| 22:58 | ProxySQL automatic reconfiguration |
| 23:01 | First successful SELECT on the secondary cluster |
Lessons learned
- Backups are not enough — without a tested recovery plan, they are useless
- Galera IST vs SST — the difference between 2 minutes and 2 hours of recovery
- PmaControl detected the incident in 12 seconds — before even the OVH alert
Conclusion
14 minutes between the alert and the first SELECT. This is the result of preparation, not luck.
"A backup does not replace a recovery strategy." — PmaControl
Comments (0)
No comments yet.
Leave a comment