Pattern #14 — Stale Lock File¶
Category: I/O & Persistence Severity: High Affected frameworks: LangChain / CrewAI / AutoGen / LangGraph / Custom Average debugging time if undetected: 1 to 5 days (the system appears "hung" on a specific resource with no error message; the lock file is found only by inspecting the filesystem directly)
1. Observable Symptoms¶
A critical file that should be updated regularly has stopped being updated. The system appears to hang on a specific operation — config writes fail silently, state updates never complete, and logs show either timeouts or complete silence around the locked resource.
The most confusing symptom: there is no running process holding the lock. The lock was created by a process that crashed mid-operation. The lock file persists on disk, and every subsequent process that tries to acquire the lock waits forever (or times out and skips the operation).
Manual inspection of the filesystem reveals a .lock file or a *.lck file with a modification timestamp from hours or days ago. Deleting the lock file manually resolves the issue — until the next crash recreates it.
2. Field Story (anonymized)¶
An API gateway management system used a file lock to serialize writes to a shared routing configuration. The lock was acquired with open("config.lock", "x") (exclusive create) and released with os.remove("config.lock") after the write completed.
One night, the config writer process was killed by the OS OOM killer mid-write. The config.lock file remained on disk. From that point, every routing update attempt saw the lock file, assumed another process was writing, and waited. After a 30-second timeout, it silently skipped the update.
For 3 days, the routing configuration was frozen. New routes weren't added, stale routes weren't removed, and load balancing weights weren't updated. The team discovered the issue when a customer reported routing errors. The fix was rm config.lock — a 2-second operation that took 3 days to discover.
3. Technical Root Cause¶
The bug occurs when a lock mechanism doesn't handle the case where the lock holder crashes without releasing the lock:
# Dangerous pattern: manual lock with no crash protection
def update_config(new_data: dict):
# Acquire lock
lock_path = "config.lock"
while os.path.exists(lock_path):
time.sleep(0.1) # Wait for lock to be released
open(lock_path, "w").close() # Create lock file
try:
# Write config
with open("config.json", "w") as f:
json.dump(new_data, f)
finally:
os.remove(lock_path) # Release lock
# PROBLEM: if the process is killed between creating the lock
# and the finally block, the lock file persists forever
The fundamental problem: the lock lifecycle is tied to the process lifecycle, but the lock file persists beyond the process. When the process dies, the lock should die with it — but file-based locks don't have this property.
Common crash scenarios that leave stale locks:
- OOM killer terminates the process
- kill -9 (SIGKILL) bypasses Python's finally blocks and atexit handlers
- Power failure or system reboot
- Unhandled exception outside the try/finally block
- Deadlock in the write operation itself (process hangs, gets killed by a watchdog)
4. Detection¶
4.1 Manual code audit¶
Search for lock file patterns without TTL or staleness detection:
# Find lock file creation
grep -rn "\.lock\|\.lck\|fcntl\.flock\|portalocker\|lockfile" --include="*.py"
# Check if locks have timeout/TTL logic
grep -A10 "\.lock" --include="*.py" -rn | grep -i "timeout\|ttl\|stale\|age\|expire"
If locks are created but no staleness/timeout logic exists, stale locks are possible.
4.2 Automated CI/CD¶
Test that a simulated crash doesn't leave a permanent lock:
import os, signal, multiprocessing, time
def test_lock_survives_crash():
"""Verify lock is cleaned up even if the holder crashes."""
def crashable_writer(lock_path):
open(lock_path, "w").close() # Acquire lock
os.kill(os.getpid(), signal.SIGTERM) # Simulate crash
lock_path = "/tmp/test.lock"
if os.path.exists(lock_path):
os.remove(lock_path)
p = multiprocessing.Process(target=crashable_writer, args=(lock_path,))
p.start()
p.join(timeout=5)
# Lock should not persist after crash
time.sleep(1)
stale = os.path.exists(lock_path)
if stale:
os.remove(lock_path)
assert not stale, "STALE LOCK: lock file persists after process crash"
4.3 Runtime production¶
Periodic lock staleness checker:
import os, time, logging
from pathlib import Path
class LockStalenessChecker:
"""Detects and optionally cleans stale lock files."""
def __init__(self, max_age_seconds: int = 300):
self.max_age = max_age_seconds
def check(self, lock_path: str) -> dict:
path = Path(lock_path)
if not path.exists():
return {"stale": False, "exists": False}
age = time.time() - path.stat().st_mtime
if age > self.max_age:
return {
"stale": True,
"age_seconds": round(age),
"message": f"Lock {lock_path} is {age:.0f}s old (max {self.max_age}s)",
}
return {"stale": False, "age_seconds": round(age)}
def clean_if_stale(self, lock_path: str) -> bool:
result = self.check(lock_path)
if result.get("stale"):
os.remove(lock_path)
logging.warning(f"STALE LOCK REMOVED: {result['message']}")
return True
return False
5. Fix¶
5.1 Immediate fix¶
Add a TTL to the lock: if the lock file is older than N seconds, consider it stale and remove it:
import os, time, json
LOCK_TTL_SECONDS = 120
def acquire_lock(lock_path: str, timeout: int = 30) -> bool:
"""Acquire a lock with automatic stale detection."""
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
# Check for stale lock
if os.path.exists(lock_path):
age = time.time() - os.path.getmtime(lock_path)
if age > LOCK_TTL_SECONDS:
os.remove(lock_path) # Stale — remove it
continue
# Try to create lock
try:
fd = os.open(lock_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
os.write(fd, str(os.getpid()).encode())
os.close(fd)
return True
except FileExistsError:
time.sleep(0.1)
return False # Timeout
def release_lock(lock_path: str):
try:
os.remove(lock_path)
except FileNotFoundError:
pass
5.2 Robust fix¶
Use a context manager that writes PID + timestamp and auto-cleans on stale detection:
import os, time, json
from contextlib import contextmanager
@contextmanager
def file_lock(lock_path: str, ttl: int = 120, timeout: int = 30):
"""Context manager with PID tracking and stale detection."""
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
# Check stale
if os.path.exists(lock_path):
try:
with open(lock_path) as f:
info = json.load(f)
pid = info.get("pid", -1)
created = info.get("created", 0)
# Check if holder is still alive
if pid > 0:
try:
os.kill(pid, 0) # Check if process exists
except OSError:
os.remove(lock_path) # Process dead, remove stale lock
continue
# Check TTL
if time.time() - created > ttl:
os.remove(lock_path)
continue
except (json.JSONDecodeError, OSError):
os.remove(lock_path)
continue
# Acquire
try:
fd = os.open(lock_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
with os.fdopen(fd, "w") as f:
json.dump({"pid": os.getpid(), "created": time.time()}, f)
break
except FileExistsError:
time.sleep(0.1)
else:
raise TimeoutError(f"Could not acquire lock {lock_path} in {timeout}s")
try:
yield
finally:
try:
os.remove(lock_path)
except FileNotFoundError:
pass
# Usage:
with file_lock("config.lock"):
update_config(new_data)
6. Architectural Prevention¶
The safest approach: don't use file locks at all. Use SQLite with WAL mode (built-in locking), or a database transaction. File locks are inherently fragile because the lock's lifecycle isn't tied to the process lifecycle.
If file locks are required (legacy compatibility), every lock must have:
1. TTL: maximum age before auto-removal (default: 2 minutes)
2. PID tracking: the holder's PID is written in the lock file, enabling stale detection via os.kill(pid, 0)
3. Watchdog: a separate process checks for stale locks every 60 seconds
4. Startup cleanup: on system boot, remove all lock files unconditionally (no process can be holding a lock if the system just started)
7. Anti-patterns to Avoid¶
-
Lock without TTL. A lock that can live forever is a system halt waiting to happen. Always set a maximum age.
-
Lock without PID tracking. Without the holder's PID, there's no way to distinguish "held by a live process" from "held by a dead process."
-
Relying on
finallyfor lock release.finallydoesn't execute on SIGKILL, OOM kill, or power failure. The lock must be self-healing via TTL. -
Blocking forever on lock acquisition.
while os.path.exists(lock_path): sleep(0.1)without timeout will hang the process indefinitely on a stale lock. -
Manual lock cleanup as a standard procedure. If the ops team regularly runs
rm *.lock, the system needs a better lock mechanism.
8. Edge Cases and Variants¶
Variant 1: NFS lock files. File locking on NFS is unreliable. fcntl.flock may not work across NFS mounts. Use a database or a coordination service (Consul, etcd) for distributed locking.
Variant 2: Windows lock files. On Windows, file locking semantics differ from Unix. os.open(path, O_CREAT | O_EXCL) works but fcntl doesn't exist. Use msvcrt.locking or portalocker for cross-platform compatibility.
Variant 3: Lock directory instead of lock file. os.mkdir("config.lockdir") is atomic on most filesystems. Cleaner than file-based locks but same stale problem applies.
Variant 4: Double crash. The lock holder crashes. A watchdog detects the stale lock and removes it. A new process acquires the lock. The new process also crashes. Now the lock is stale again, and the watchdog's interval determines how long the system is blocked.
9. Audit Checklist¶
- [ ] Every file lock has a TTL (max age before considered stale)
- [ ] Lock files contain the holder's PID for liveness checking
- [ ] A watchdog or startup script cleans stale locks automatically
- [ ] Lock acquisition has a timeout (never blocks forever)
- [ ] Tests simulate process crash and verify lock cleanup
10. Further Reading¶
- Related patterns: #11 (Race Condition on Shared File — locks are the fix for race conditions, stale locks are a bug in the fix), #04 (Multi-File State Desync — a stale lock can freeze a state file)
- Recommended reading:
- Python
portalockerlibrary documentation — cross-platform file locking with timeout - "The Art of Multiprocessor Programming" (Herlihy & Shavit), chapter on lock-free algorithms — why file locks are a last resort