Surviving NFS in Kubernetes
A field guide to the fsync hell you didn't know was coming
I run a homelab Kubernetes cluster, and NFS is my primary PVC storage backend.
The reason is straightforward: when Pods get rescheduled across nodes, local disks create a problem. A PVC backed by local storage pins a Pod to a specific node. If you need to drain that node — say, to upgrade memory or replace hardware — you're stuck. NFS-backed PVCs mount from any node, so you can freely reschedule Pods wherever you want.
Sounds great in theory. In practice, NFS is a lot more unforgiving than people expect.
Here's what actually happened when I put the wrong things on NFS.
The Root Cause: fsync Latency
Almost every NFS-related failure I've encountered traces back to fsync latency.
On a local SSD, fsync typically completes in under 1ms. On NFS, it means a full network round trip to the NAS server plus a disk flush on the other end. In a homelab environment, that can easily be 10–100ms.
The problem is that some software calls fsync synchronously. It issues the call and then waits — completely blocked — until the storage confirms the write is durable. On local disk, this wait is barely noticeable. On NFS, this wait stacks up.
When fsync blocks the main thread long enough, the process stops responding to health checks. The kubelet notices the liveness probe is timing out. After a few consecutive failures, it sends SIGKILL. The container gets an Exit Code 137.
That's the pattern. Everything else is just a variation on it.
Incident 1: SQLite on NFS Took Down Sentry's Taskbroker
Sentry's taskbroker component uses SQLite internally. In WAL (Write-Ahead Logging) mode, SQLite calls fsync on the WAL file after every transaction commit. This is by design — it's how SQLite guarantees data durability.
On local disk, this is completely fine. On NFS, every transaction commit waits for a network round trip plus a remote disk flush. When taskbroker is busy processing jobs, fsync calls pile up. The component gets slow, stops responding, and the liveness probe kills it.
The fix was simple. The SQLite data in taskbroker isn't meant to be durable. It's a temporary job queue — if the Pod restarts, the queue refills. There's no reason to put it on NFS at all.
Switching to emptyDir solved it completely:
sentry:
taskBroker:
persistence:
enabled: false # emptyDir instead of NFS PVCPod restarts, queue is gone, queue comes back. No one cares.
Takeaway: SQLite on NFS is a trap. If the data doesn't need durability, use emptyDir.
Incident 2: Redis AOF on NFS Took Down All of Sentry
This one happened today.
Sentry suddenly started falling apart. Checking events in the sentry namespace, this is what I found:
Warning Unhealthy pod/sentry-sentry-redis-master-0
Liveness probe failed: Timed out
Normal Killing pod/sentry-sentry-redis-master-0
Container redis failed liveness probe, will be restarted
Redis master failed its liveness probe 5 times in a row and got SIGKILLed (Exit Code 137). The replica lost its connection to the master and restarted 18 times. While Redis was unstable, Sentry's web pods and taskworkers couldn't connect and failed their own probes.
Looking at the previous container's logs:
* Asynchronous AOF fsync is taking too long (disk is busy?)
* Starting automatic rewriting of AOF on 17368% growth
The AOF file had grown 17368% and triggered an automatic rewrite. After that, the NFS disk I/O was saturated, and AOF fsync kept running slow.
Redis with appendfsync everysec (the default) calls fsync every second on the AOF file. On NFS, that fsync blocks Redis's main thread. The main thread blocked means no response to ping. No response to ping means the liveness probe times out. Kubelet kills the container.
Disk space wasn't the issue — the NFS volume had 3.2GB used out of 17TB. Pure latency.
The fix: set appendfsync no.
redis:
commonConfiguration: |-
appendonly yes
appendfsync no # delegate fsync to the OS kernel
save ""With appendfsync no, Redis doesn't call fsync at all. The OS kernel handles flushing at its own pace. Redis's main thread never blocks, liveness probes pass, everything stays up.
The trade-off: if Redis or the NAS crashes, you might lose the last few seconds of writes. For Sentry's Redis — which is a cache and job queue — that's acceptable.
Takeaway: Redis AOF with appendfsync everysec on NFS is a time bomb. Switch to appendfsync no or disable AOF entirely.
Incident 3: Sentry Filestore Blocked a Node Drain
This one is the opposite — a case where not using NFS caused the problem.
When I initially set up Sentry's file storage (attachments, icons, etc.), I used the local-path StorageClass. It provisions PVCs from a node's local directory. Fast, simple to configure, and works fine — until you need to move the node.
When I needed to drain a node to upgrade its memory:
$ kubectl drain k3s-gs-worker1 --ignore-daemonsets --delete-emptydir-data
error: cannot delete Pods with local storage
Pods with local-path PVCs can't be drained. The kubelet refuses because draining would destroy local data. The node couldn't be upgraded without force-deleting the Pod and losing the data.
I had to migrate the PVC to the NFS-backed StorageClass. That meant deleting the old PVC, running helm upgrade to recreate it on NFS, dealing with a Pod stuck in Terminating state, and eventually using force delete to break the deadlock. Not catastrophic, but entirely avoidable.
Takeaway: If a Pod needs to move between nodes, use NFS from the start. local-path is only appropriate when you're certain the data can stay tied to a single node.
Incident 4: Vault Escaped NFS for local-path
Going the other direction: Vault used to run on NFS PVCs. When I migrated to the new cluster, I switched it to local-path.
Vault uses the Raft consensus protocol internally. Raft writes a WAL file and calls fsync on it during leader elections and log commits. On NFS, slow fsync means slow leader elections. Slow leader elections mean an unstable cluster.
More importantly, Vault's HA story doesn't depend on shared storage. Three replicas across three regions — if one goes down, the other two maintain quorum. There's no reason to share storage at all. Fast local SSDs actually improve Raft's performance.
Old cluster: Vault StatefulSet + NFS PVC → slow fsync, instability risk
New cluster: Vault StatefulSet + local-path (SSD) → fast Raft, no shared storage needed
Takeaway: If the application provides its own HA (Raft, replication, etc.), local-path on a fast disk beats NFS. Use NFS only when sharing is actually required.
The Summary: What Belongs on NFS and What Doesn't
Here's what I've landed on after these incidents.
Safe on NFS
| Type | Reason |
|---|---|
| Static files, media assets | Infrequent writes, sharing required |
| Shared config files | Rarely written |
| Data that needs to move across nodes | NFS's core value |
Dangerous on NFS
| Type | Why | Alternative |
|---|---|---|
| SQLite (WAL mode) | fsync on every commit | emptyDir (if no durability needed) |
| Redis AOF | fsync every second, blocks main thread | appendfsync no or disable AOF |
| Raft-based storage (Vault, etcd) | fsync-sensitive leader elections | local-path on SSD |
| High-write databases | fsync accumulation saturates I/O | Local disk or dedicated storage |
Questions to Ask Before Mounting on NFS
- Does this data actually need to be shared? If not, consider local-path or emptyDir.
- How often does this software call fsync? Databases and queue systems should always be suspected.
- How bad is it if recent writes are lost? Caches and queues can usually tolerate
appendfsync no. - Is the liveness probe timeout generous enough? fsync delays can cause spurious probe failures.
NFS is convenient. But putting fsync-sensitive software on NFS without thinking it through is a deferred failure. The incident doesn't happen at setup time — it happens weeks later when load increases and the disk gets busy.
The time to think about this is during design, not at 3am during an incident.