Native RDMA reads — installation & operation

Technical reference for RDMA reads: how to install the RDMA-enabled binaries, enable the listener, and verify that reads are actually flowing over RDMA rather than falling back to TCP.

SeaweedFS Enterprise can serve volume reads over one-sided RDMA: the volume server registers each volume’s .dat file as a remote-readable memory region, and a client issues an RDMA READ straight out of it. The server does no per-read work, and any read that isn’t RDMA-eligible falls back to HTTP/TCP transparently — so an RDMA build is always at least as capable as a plain one.

Requirements

  • Linux + rdma-core. The RDMA build dynamically links libibverbs and librdmacm, so the host needs the rdma-core runtime libraries even when RDMA is disabled — otherwise the binary won’t start. The Debian/RPM packages declare this dependency for you.
  • An RDMA-capable fabric: RoCE (RDMA over Converged Ethernet) or InfiniBand. For evaluation you can use SoftRoCE (rdma_rxe) over a normal NIC — no special hardware required (see Testing with SoftRoCE).
  • RDMA is a read acceleration. Writes always use HTTP.

Choose the right build

RDMA support is shipped as separate -rdma assets so non-RDMA hosts don’t have to carry the rdma-core dependency:

Component Standard asset RDMA asset
Volume server weed-volume-enterprise_* weed-volume-enterprise-rdma_*
Kernel-mount daemon seaweedfs-vfs seaweedfs-vfs-rdma

Install the -rdma variant only on hosts that have (or will have) an RDMA device. The kernel module, DKMS source, and node images are RDMA-agnostic — one module serves either daemon.

Volume server

Install the RDMA volume-server asset (it requires rdma-core at runtime), then enable the listener:

weed-volume \
  --ip 10.0.0.6 --port 8080 --master <master>:9333 --dir /data/vol --max 100 \
  --rdma.enabled \
  --rdma.ip 10.0.0.6        # must be an IP on the RDMA netdev; defaults to --ip
Flag / env Default Notes
--rdma.enabled off Master switch. Off ⇒ HTTP/gRPC only.
--rdma.port 18516 RDMA listener port.
--rdma.ip --ip Must be an address on the RDMA netdev; binding loopback fails.
SWFS_RDMA_PORT (env) Overrides --rdma.port at runtime.

On startup the log shows the listener binding:

RDMA listener bound on 10.0.0.6:18516
RDMA listener active

If the bind fails (no device, wrong IP), the server logs the reason and serves HTTP only — clients keep working over TCP. RDMA is strictly opt-in and fail-safe.

Kernel-mount daemon

Install seaweedfs-vfs-rdma instead of seaweedfs-vfs. It packages the sw-rdma-kd daemon at the same path and reuses the same mount helper and systemd units, so it is a drop-in replacement (the two packages conflict — install one or the other). It requires rdma-core.

apt install ./seaweedfs-vfs-rdma_<ver>_<arch>.deb   # or the .rpm
mount -t seaweedvfs none /mnt/seaweed                # same as the non-RDMA daemon

The daemon attempts RDMA for every eligible read and falls back to HTTP otherwise — no client configuration is needed. Other RDMA clients can integrate against the same PrepareRdmaRead gRPC control plane.

How a read flows

  1. The client resolves the volume location from the filer, then calls PrepareRdmaRead on the volume server’s gRPC endpoint. The server validates the file’s cookie against its live index (this is the authorization boundary) and returns a short-lived capability: {rdma endpoint, remote address, rkey, length}.
  2. The client issues a one-sided RDMA READ of that range directly from the volume’s memory region. The server’s CPU is not involved in the transfer.
  3. The returned bytes are the file’s data payload.

A read returns “not eligible” (and the client uses HTTP) when the needle is gzip-compressed or a chunk manifest (the raw on-disk bytes wouldn’t match the decompressed HTTP response), when the volume isn’t registered, or when RDMA is disabled. Manifest (range-less) reads always use HTTP.

Consistency & safety

The memory region tracks the bytes: when a volume is deleted, compacted, rebuilt from EC shards, tiered out, or unmounted, its region is revoked or re-registered with a fresh key. A client holding a stale key gets a remote-access error on its READ and falls back to HTTP — it can never read freed or relocated memory. The cookie + live-index check on every PrepareRdmaRead is the access-control boundary; a one-sided READ itself performs no checks.

Verifying RDMA is actually used

  • Daemon log (sw-rdma-kd at debug level) shows native reads vs. fallback:

    rdma read OK: 4,01ce92fed8 (262144 B) via 10.0.0.6:8080      # native RDMA
    rdma read <fid> failed (...); HTTP fallback                  # fell back to TCP
    
  • Device counters confirm verbs traffic (TCP/HTTP does not touch these):

    watch -n1 'cat /sys/class/infiniband/<dev>/ports/1/hw_counters/{sent,rcvd}_pkts'
    

    A read over RDMA bumps sent_pkts/rcvd_pkts by roughly size / MTU packets; an HTTP fallback leaves them unchanged.

Testing with SoftRoCE

No RDMA NIC? Bring up SoftRoCE over an existing interface (kernel rdma_rxe):

sudo modprobe rdma_rxe
sudo rdma link add rxe0 type rxe netdev eth0
rdma link                       # rxe0 should be ACTIVE
ulimit -l unlimited             # let reg_mr pin memory regions

Then point --rdma.ip at the interface’s address. SoftRoCE drives the same verbs path as hardware RDMA (with higher latency), so it’s a faithful way to validate an install end-to-end.