Context
On March 6, 2026, a production MariaDB 10.11.15 server monitored by PmaControl suffered a major incident. Unlike typical crashes (OOM, segfault), this one presented unique symptoms:
RocksDB: Error opening instance, Status Code: 2,
Status: Corruption: truncated record body
Incorrect information in file: './pmacontrol/ts_value_general_int.frm'
Can't init tc log
Aborting
The server restarted in a loop several times before stabilising, with .frm mismatch errors on multiple time-series tables.
The MDEV-39044 ticket
After investigation, we correlated this incident with MariaDB ticket MDEV-39044:
MyRocks corruption after restart during/after ALTER workload: Corruption: truncated record body, .frm mismatch, no crash log, no OOM killer
What the ticket describes
The ticket documents a reproducible corruption scenario:
- Large partitioned RocksDB tables — exactly what PmaControl uses for metrics (
ts_value_*tables partitioned by day) ALTER TABLEunder write load — adding partitions while the application writes continuously- Simultaneous InnoDB memory pressure — InnoDB and RocksDB tables coexist on the same server
- No kernel trace — no OOM killer, no segfault, no crash log
Why it's insidious
The most dangerous aspect of the ticket: the absence of a crash log is the expected behaviour in this scenario. The server restarts, performs InnoDB crash recovery, but the RocksDB metadata is corrupted (.frm mismatch).
A DBA who only checks journalctl or dmesg will find nothing. They'll classify the incident as an "unexplained restart" and move on.
Our concrete case
Affected tables
All partitioned RocksDB tables with heavy daily writes:
ts_value_general_int— integer metrics (status variables, counters)ts_value_general_json— complex JSON metricsts_mysql_digest_stat— query statistics (digests)ts_value_general_text— text metricsts_value_slave_int— replication metricsts_value_slave_text— detailed replication states
The likely trigger
PmaControl automatically maintains partitions on these tables: adding next day's partition, dropping expired partitions. These are ALTER TABLE ... ADD PARTITION / DROP PARTITION operations on tables weighing tens of GB, while collection workers write continuously (every 10 seconds per monitored server).
Memory pressure signals
Before the crash, the MariaDB log shows:
InnoDB: Memory pressure event disregarded
The MDEV-39044 ticket explicitly cites this pattern as an aggravating factor. InnoDB memory pressure doesn't directly cause the corruption, but it creates the context in which the RocksDB DDL becomes non-atomic.
How PmaControl detected the incident
- Uptime reset detected within 10 seconds via the
ts_variable.uptimetime series - Telegram alert sent immediately
- Automatic correlation with the error log: detection of
crash recovery+truncated record bodysignatures - Retrospective analysis: metrics from the preceding hour (threads, memory, CPU) were normal — confirming this is not a typical load issue
Recommendations
Immediate actions
-
Do not run DDL on RocksDB tables under write load. Schedule
ALTER TABLE ... ADD/DROP PARTITIONduring low-activity windows. -
Monitor
.frmerrors in the error log. This is the first indicator of post-DDL corruption. -
Follow ticket MDEV-39044 for an official fix.
Structural actions
-
Separate engines: if possible, do not mix InnoDB and RocksDB on the same server for critical tables.
-
Consider migrating hot tables to InnoDB. RocksDB excels at sequential writes, but its DDL operations are not atomic under load.
-
Size memory properly to avoid the InnoDB pressure that aggravates the problem. See our article on the OOM killer for worst-case calculations.
What it is not
- It is not a hardware problem (disk, RAM)
- It is not a MySQL configuration problem (parameters are correct)
- It is not reproducible on demand (it's a race condition in the RocksDB/DDL engine)
It is an engine bug documented by MariaDB themselves.
Conclusion
MDEV-39044 is a reminder that using alternative storage engines (RocksDB, TokuDB) on production workloads requires particular vigilance around DDL. The absence of a crash log does not mean the absence of corruption.
PmaControl detects these incidents through uptime monitoring + error log correlation, where standard tools see nothing.
Comments (0)
No comments yet.
Leave a comment