EC Bitrot Scrub — Internals, Config & Limits
Technical reference for EC Bitrot Scrub — the checksum sidecar, the scrub and auto-repair flow, configuration, and limits.
How it works
Checksum sidecar
Each EC volume has an optional per-volume checksum sidecar (.ecsum, or .ecsum.v<N> per vacuum generation) holding a CRC32C (Castagnoli) checksum for every fixed-size block — default 16 MiB — of every shard, data and parity alike. This lets a scrub (and the reconstruction path) detect silent disk corruption in any shard, including cold parity shards that are never read during normal serving.
The sidecar header carries its own CRC, so a loader detects corruption of the sidecar itself before trusting its contents — a rotted sidecar is reported as an integrity error, never mistaken for shard corruption. The sidecar is optional and degrades gracefully: an absent or generation-mismatched sidecar means “feature off” for that generation, so older binaries, JSON-only nodes, and rollback deployments simply ignore it.
Detection (scheduling)
A worker plugin task enumerates EC volumes from the master topology — one scrub candidate per distinct (volume ID, collection, disk type) — honoring an optional collection filter and result paging. It does not pre-judge shard health; the scrub itself finds corruption. Detection is skipped when a successful run is more recent than the minimum interval.
Scrub + auto-repair (execution)
For each volume the executor looks up the shard holders from the master and runs a read-only CHECKSUM scrub on each holder (the same verification behind the ec.scrub shell command).
When auto-repair is enabled and the scrub reports broken shards (not sidecar-integrity errors), it quarantines each broken shard — unmount + delete, bounded by the volume’s parity count — so the normal ec.rebuild regenerates and Reed-Solomon-verifies a clean copy. Integrity errors (a bad sidecar) are reported, not quarantined.
Each run reports: files verified, broken shards, quarantined shards, and integrity errors.
Configuration
Admin config
| Field | Default | Purpose |
|---|---|---|
collection_filter |
(all collections) | Only scrub EC volumes in this collection |
auto_repair |
false |
Quarantine corrupt shards for rebuild; off = read-only |
Worker config
| Field | Default | Purpose |
|---|---|---|
min_interval_seconds |
300 |
Skip detection if the last successful run is more recent |
Admin runtime defaults: detection every 10 minutes, detection timeout 300s, up to 500 jobs per detection, global execution concurrency 8 (2 per worker), execution/job runtime cap 1800s, 1 retry with 30s backoff.
Manual scrub (shell)
The same verification is available on demand from weed shell:
ec.scrub # scrub EC volume contents on volume servers
It supports scrubbing only needle data or deep-scrubbing file contents as well, and can be limited to specific EC volume IDs on specific volume servers (by default all EC volumes across all servers are processed).
Requirements & limits
- Enterprise-only: requires a valid license. Without one, detection is skipped and the job reports that it requires a valid enterprise license.
- Operates on erasure-coded volumes only. For corruption in replicated/normal volumes after an unclean shutdown, see Self-Healing Storage; to rebuild missing shards after disk/node loss, see Automatic EC Repair.
- Auto-repair can quarantine at most the parity count of shards per volume — corruption beyond what parity can rebuild is reported, not auto-repaired.
- The checksum sidecar is optional; a volume without a current-generation sidecar is “feature off” for checksum verification and is effectively skipped.
License expiry: if your enterprise license expires, EC Bitrot Scrub detection stops scheduling new scrubs and falls back to the open-source behavior.