FLR Resync Troubleshooting Examples
FLR resync synchronizes stale mirrors by copying data from up-to-date mirrors, essential post-write or after OST failures/failovers. Common issues include failures due to space/network errors, incomplete resync, or detection problems. This guide provides examples for Lustre 2.17.0 (January 2026), drawing from the Lustre Operations Manual (updated 2025) and community resources like wiki and man pages. Always check logs (/var/log/messages) and use lctl debug_daemon for debugging.
Common Issues and Error Codes
| Issue | Error Code | Description/Troubleshooting |
|---|---|---|
| Resync Fails with ENOSPC | -28 (ENOSPC) | No space on stale mirror OSTs. Check lfs df -h; migrate data or deactivate full OSTs. |
| Resync Hangs/Timeouts | -110 (ETIMEDOUT) | Network issues or slow OSTs. Monitor lnetctl stats show; tune at_max. |
| No Stale Mirrors Detected | - | Resync does nothing if no stales. Verify with lfs find --mirror-state=^ro. |
| Incomplete Resync | -5 (EIO) | I/O errors during copy. Run lfs mirror verify; check OST health with lctl get_param obdfilter.*.degraded. |
| Resync Performance Slow | - | High load or small RPCs. Tune osc.*.max_pages_per_rpc=1024; use jobstats to monitor. |
| Mirrors Remain Stale Post-Resync | - | Concurrent writes. Rerun resync; lock file if needed. |
| Verify Fails with Mismatch | - | Data corruption. Compare checksums; restore from backup or resync again. |
| Client Cannot Access During Resync | -107 (ENOTCONN) | Temporary; reads from up-to-date mirrors. Wait or check client version (≥2.11). |
Diagnostic Tools
| Tool/Command | Use |
|---|---|
| lfs find --mirror-state=^ro | Find stale files/mirrors. |
| lfs getstripe -v | View mirror states (stale, prefer). |
| lfs mirror verify -v | Check consistency post-resync. |
| lctl dk logfile | Capture debug logs during resync. |
| lctl get_param job_stats | Monitor resync load (enable first: lctl set_param mdt.*.job_stats=enable). |
| lfs df -h | Check space on mirrors. |
| lnetctl stats show | Network errors during resync. |
| lctl lfsck_start -t layout | Repair layout inconsistencies. |
Examples
Example 1: Resync Failure Due to No Space
# Attempt resync
lfs mirror resync /mnt/lustre/file1
# Error: ENOSPC
# Diagnose
lfs df -h # Shows mirror OST full
lfs getstripe /mnt/lustre/file1 # Identify stale mirror OSTs
# Fix: Mirror file to new OSTs with free space
lfs mirror extend -N /mnt/lustre/file1
# split stale data mirror from full OST
lfs mirror split -d --mirror-id 2 /mnt/lustre/file1
Example 2: Incomplete Resync After OST Failover
# After failover (OST degraded)
lctl set_param obdfilter.testfs-OST0005.degraded=1
# Find stales
lfs find --mirror-state=^ro /mnt/lustre/dir > stale_list.txt
# Resync batch
cat stale_list.txt | xargs lfs mirror resync
# If incomplete (EIO in logs)
lctl dk > resync_log.txt # Check for I/O errors
# Fix: Reactivate OST post-recovery
lctl set_param obdfilter.testfs-OST0005.degraded=0
lfs mirror resync /mnt/lustre/dir/file2 # Rerun
# Verify
lfs mirror verify -v /mnt/lustre/dir/file2
Example 3: Slow Resync Performance
# Monitor during resync
llstat -i 5 ost.OSS.ost_io
# If slow, tune RPCs
lctl set_param osc.*.max_pages_per_rpc=1024
# Enable jobstats
lctl set_param -P jobid_var=procname_uid
lljobstat # Check resync ops
# After, clear stats
lctl set_param osc.*.stats=clear
Example 4: Mismatch After Verify
# Verify fails
lfs mirror verify /mnt/lustre/file1 # Shows mismatch
# Diagnose
lfs getstripe -v /mnt/lustre/file1 # Check layout
# Fix: Resync from known good mirror
lfs mirror resync --only 1 /mnt/lustre/file1 # Assuming mirror 1 is good
# If corruption, restore from backup or use LFSCK
lctl lfsck_start -t layout -A -o /mnt/lustre
Best Practices
- Automate resync post-failover using scripts in HA tools.
- Separate mirrors across fault domains to avoid resync during correlated failures.
- Use ChangeLog to track stale events:
lfs changelog. - Schedule periodic verify/resync jobs.
- Monitor logs for "LustreError" during resync.
- For large files, limit resync concurrency to avoid overload.
For updates, check JIRA (e.g., LU-19758 for related issues). No major resync changes in 2.17.