Lustre Troubleshooting Guide

This guide covers common Lustre issues, error codes, debugging tools, and recovery procedures for Lustre 2.17.0 and 2.15.x (as of January 2026). It draws from the Lustre Operations Manual (updated August 2025), wiki tips, and LUG 2025 discussions (e.g., DNE3 for metadata scaling). All error codes are standard Linux errno values (negative), with context-specific meanings in subsystems like Client (file I/O, RPCs), MDS (metadata, permissions), OST (data storage), MGS (configuration), LNet (networking). Lustre does not define custom positive error codes; 0 indicates success. Errors appear in logs (/var/log/messages) prefixed with "LustreError", including function, module, PID, nodes, and RPC details. Use lctl debug_daemon for kernel debug. For recent bugs, search JIRA (e.g., LU-19758 for ext4 issues in 2025).

Common Issues and Error Codes

All codes correlate to Linux errno.h; meanings are Lustre-contextual. Source code handles errors via return -ERRNO in subsystems (e.g., ptlrpc for RPCs, obdclass for devices, ldlm for locks).

IssueSubsystem/ComponentError CodeDescription/Troubleshooting
Operation not permittedGeneral (Client/MDS)-1 (EPERM)Permission denied; check ACLs, nodemap, or SELinux policies.
OST object missing/damagedOST/Client-2 (ENOENT)No such file/entry; check /lost+found on OST. Salvage with LFSCK; unlink corrupted objects.
Operation interruptedGeneral-4 (EINTR)Interrupted by signal (e.g., CTRL-C); retry operation.
I/O failureOST/Client-5 (EIO)Read/write failed; inspect hardware, run e2fsck on ldiskfs. Common in storage errors.
No such deviceOST/MDS/Client-19 (ENODEV)Server stopped/failed over; verify with lctl device_list.
Invalid argumentGeneral (API/PTLRPC)-22 (EINVAL)Bad parameter (e.g., lfs setstripe); validate inputs.
Out of space/inodesOST/MDS-28 (ENOSPC)No space/inodes; use lfs df -h/-i. Migrate with lfs migrate; check open files on MDS.
OST read-onlyOST-30 (EROFS)Filesystem read-only due to error; fix hardware, restart services.
Identifier removedMDS/Client-43 (EIDRM)UID/GID mismatch; update /etc/passwd, /etc/group on MDS.
Not connectedLNet/Client-107 (ENOTCONN)Client-server disconnect; check lctl ping, peers.
Shutdown in progressLNet/Client-108 (ESHUTDOWN)I/O during shutdown; apps may not handle; retry post-recovery.
Connection timed outLNet/Client/Server-110 (ETIMEDOUT)RPC/network timeout; tune at_max/at_min, check network congestion.
Quota exceededQuota (MDT/OST)-122 (EDQUOT)Disk quota hit; use lfs quota to check/adjust limits.
No quota foundQuota-3 (ESRCH)Quotas not enabled; configure on filesystem.
Out of memoryGeneral (OST/Client)-12 (ENOMEM)Kernel OOM; check logs, increase RAM or tune threads.
Bad addressAPI/Client-14 (EFAULT)Memory access issue; ensure buffers allocated properly.
Not supportedAPI/Server-95 (ENOTSUPP)Feature not supported (e.g., old server); check Lustre version.
No keyClient (File/directory encryption)-126 (ENOKEY)File/directory encryption key missing; use fscrypt unlock.
Operation not supportedMDT (ACL)-95 (EOPNOTSUPP)ACL not enabled; configure on MDS, or unknown RPC opcode.
Back in time errorMDT/OST-5 (EIO) or -30 (EROFS)Transaction loss, filesystem read-only; run e2fsck to repair.
Slow page writeClient-110 (ETIMEDOUT)Memory allocation delays; tune VM parameters.
Watchdog timeoutAny-110 (ETIMEDOUT)Slow ops (e.g., RAID rebuild); capture stack with lctl dk.
Timeouts on setupClient/MGS/LNet-110 (ETIMEDOUT)Firewall/DNS issues; check port 988, hosts.deny.
No matching NIDLNet-22 (EINVAL)Config mismatch; verify networks/routes with lnetctl.
Mount failureClient-107 (ENOTCONN) or -110 (ETIMEDOUT)Firewall/hosts.deny; check syslogs, lctl which_nid.
Dead routersLNet-107 (ENOTCONN)Enable router_checker; set auto_down=1.
Asymmetric routesLNet-107 (ENOTCONN)Unknown routers; check drops with lctl get_param stats.
Changelog overloadMDT-28 (ENOSPC)Purge records; lfs changelog_clear.
Client evictionClient (LDLM)-107 (ENOTCONN)DLM/ping failures; check connectivity.
Server crashOSS/MDS-5 (EIO)Journal replay on recovery; check for LBUG in logs.
Memory/lock contentionClient/Server (LDLM)-12 (ENOMEM)High locks; clear LRU with ldlm.namespaces.*.lru_size=clear.
File fragmentationAny-5 (EIO)Aged FS; use filefrag -v.
Striping imbalanceOSTs-28 (ENOSPC)Variance >17%; use weighted allocation.
Inactive/Degraded OSTOST-5 (EIO) or -30 (EROFS)Deactivate; migrate data with lfs_migrate.
MMP conflictAny-22 (EINVAL)Multiple mounts; delay mount >=10s.
Root squash failMDT-1 (EPERM)Untrusted clients; add to nodemap.
Namespace failureMDT-2 (ENOENT)Subdir/fileset missing; check DNE config.
SELinux denialClient-13 (EACCES)Policy mismatch; use l_getsepol.
Kerberos failureAny (SEC)-13 (EACCES) or -126 (ENOKEY)Clock sync/FQDN issues; check GSS.
ZFS snapshot failOST-110 (ETIMEDOUT)Barrier/timeout; check inconsistencies.
HSM agent unresponsiveMDT (HSM)-110 (ETIMEDOUT)Copytools block; timeout 3600s; set mdt.*.hsm_control=disabled.
PCC exhaustionClient (PCC)-28 (ENOSPC)Fallback to normal I/O; detach with lfs pcc detach.
Nodemap inconsistencyMDT-1 (EPERM)Unmapped NIDs squashed; modify with lctl nodemap_modify.
SSK key issuesClient (SEC)-111 (ECONNREFUSED) or -22 (EINVAL)Invalid keys/HMAC; verify with lgss_sk -r; check nodemap.
Performance anomaliesOST-5 (EIO)Faulty hardware; check variance with obdfilter-survey.
Data loss risk (I/O Kit)OST/MDT-5 (EIO)Avoid on production; overwrites devices.
Script leaksI/O Kit-12 (ENOMEM)Manual cleanup needed.
Module load failLNet-2 (ENOENT)Modprobe obdecho explicitly.
cYAML allocation failLNet-12 (ENOMEM)No memory for blocks/buffers.
Invalid networkLNet-22 (EINVAL)Non-existent net; check lnetctl net show.

Debugging Tools

ToolUseKey Commands
lctlAdmin params, debug, healthlctl get_param, debug_kernel, lfsck_start
lnetctlLNet config/healthnet show --verbose, import FILE.yaml
lfsUser/client ops, quotas, migrationdf -h/i, migrate, quota
debugfsDisk inspectiondebugfs -c -R "stat ..."
straceSyscall tracingstrace -f program
e2fsck/tune2fsldiskfs check, repair, tuninge2fsck -f, tune2fs -O mmp
dumpe2fsEmergency ldiskfs debug, recoverydumpe2fs -h
llstat/llobdstatStatsllstat -i 1 ost
collectl/lltopMonitoringConfig via collectl
perfProfilingperf top
lmtTop-like monitor viewGitHub install
filefragFragmentationfilefrag -v
WiresharkPacket capture-
kdumpCrash analysis-
lgss_sk/keyctlSSK configuration and debuglgss_sk -r
fscryptFile/directory encryptionfscrypt status
l_getsepolSELinux-
llsom_syncSync lazy file size to MDTs-
lljobstatJobID monitor (2.15+)lljobstat -c 10
ost-survey etc.Benchmarking./ost-survey.sh
stats-collectProfilinggather_stats_everywhere.sh
llog_readerConfig log dump/debugllog_reader /mnt/mgs/...
llverdevSingle file data integrity testllverdev -v
lshowmountShow mounted clients on serverlshowmount -v
lstLNet testlst new_session
cYAML utilsLNet YAMLlustre_yaml_show

Client Troubleshooting

Server Troubleshooting

For more, see Lustre Manual Error Numbers.