Remote Volume Vacuum

SeaweedFS Enterprise includes remote volume vacuum, a background worker that reclaims wasted space in cloud-tiered volumes — volumes whose data has been offloaded to remote storage such as S3 with volume.tier.upload. When files in a tiered volume are deleted, the deleted bytes stay in the remote object and keep costing you cloud storage. Remote vacuum compacts those objects and hands the reclaimed space back, automatically.

The key design advantage is that the heavy work happens on dedicated workers, not on the volume servers. The worker pulls the remote object and the volume’s index down to its own working directory, compacts locally, uploads a fresh compacted object, and commits an atomic swap on the volume server — so volume servers never need a full volume’s worth of free disk to compact a tiered volume, and never carry the CPU and I/O cost.

Why it is needed

A cloud-tiered volume keeps its index (.idx) local but moves its data object to the cloud. Deletes against such a volume only mark the index — the bytes remain in the remote object forever. Nothing in the open-source flow can reclaim them:

  • The regular vacuum can’t run — there is no local data file to compact.
  • The manual volume.tier.compact shell command works around this by downloading, compacting, and re-uploading on the volume server — which requires a full volume’s worth of free disk on the very node that tiering was meant to keep lean, and burns its I/O for the duration.

Remote vacuum moves that work to a worker with scratch disk and does it continuously.

When to use it

  • Churning cloud-tiered datasets — once deletes pile up in a tiered object, that space stays billed until it is compacted.
  • Large tiered volumes where waste adds up — a 30 GB tiered object that is 40% deleted drops to ~18 GB after vacuum.
  • Cost-sensitive cloud tiers — you pay for every byte in the remote object; reclaiming deleted bytes directly lowers the bill.
  • Hands-off operation — you want tiered-volume compaction to run in the background without manual volume.tier.compact runs and without stealing volume-server disk.

How to use it

Remote Vacuum runs as a plugin worker (a resource-intensive “heavy” task). Start one or more weed worker processes that connect to the admin server with a heavy or explicit job type:

# Start a worker that handles heavy tasks, including remote vacuum
weed worker -admin=admin.example.com:23646 -jobType=heavy \
  -workingDir=/var/lib/seaweedfs-plugin -maxExecute=2

# Or select the remote vacuum task explicitly
weed worker -admin=admin.example.com:23646 -jobType=remote_vacuum \
  -workingDir=/var/lib/seaweedfs-plugin

Remote vacuum then scans on a schedule and compacts any cloud-tiered volume past the garbage-ratio threshold (default 30%). Give the worker’s -workingDir room for roughly twice the largest tiered object it will handle (the downloaded object plus the compacted output), and run at least 2 workers for availability.

Benefits

  • Cloud-storage savings — reclaim deleted bytes in remote objects, directly lowering S3/object-store cost.
  • Volume server protection — the download, compaction, and upload run on workers, so volume servers never need spare disk to compact a tiered volume.
  • Automatic operation — runs continuously in the background instead of manual volume.tier.compact runs.
  • Crash-safe swap — the compacted object is switched in atomically under a durable commit marker, with roll-forward/roll-back recovery, so an interrupted vacuum never corrupts or loses data.
  • Retention-aware — honors the cluster deletion-retention window, preserving in-window tombstones so point-in-time and undelete features keep working.

Want the internals — the snapshot, compact, upload, and commit phases, safety guarantees, the full configuration table, worker flags, and limits? See the Remote Volume Vacuum technical reference.