Context

A Friday evening, 10:47 PM. The Telegram alert fires: the primary datacenter is down. Not a server, not a rack — the entire DC. Cascading power failure, the UPS systems didn't hold.

28 MariaDB / MySQL instances, 3 Galera clusters, 2 ProxySQL. Everything is offline.

Timeline

Time	Action
22:47	PmaControl alert — DC unreachable
22:49	OVH confirmation — electrical incident in DC
22:51	DNS failover to secondary DC
22:54	Galera bootstrap on the surviving node
22:58	ProxySQL automatic reconfiguration
23:01	First successful SELECT on the secondary cluster

Lessons learned

Backups are not enough — without a tested recovery plan, they are useless
Galera IST vs SST — the difference between 2 minutes and 2 hours of recovery
PmaControl detected the incident in 12 seconds — before even the OVH alert

Conclusion

14 minutes between the alert and the first SELECT. This is the result of preparation, not luck.

"A backup does not replace a recovery strategy." — PmaControl

Comments (0)

No comments yet.

Control the uncontrollable: anatomy of a DC crash

Context

Timeline

Lessons learned

Conclusion

Comments (0)

Leave a comment