LNet Health Monitoring Details

LNet health monitoring detects and responds to failures in network interfaces (NIs), peers, and routers, enabling automatic failover in multi-rail setups. Introduced in Lustre 2.12 and enhanced in 2.13 for routing, it remains stable through 2.17.0 (as of January 2026). This guide covers architecture, parameters, commands, and examples from the Lustre Operations Manual (updated 2025). For more, see Lustre Manual.

Architecture

Health Values

ValueRange/Meaning
Health Value (per NI/Peer)0-1000; starts at 1000; decrements on failure, increments on success.
Statusup/down; shown in lnetctl outputs.
Failure Countersdropped, resend_count, local_error_count (via stats).
Router HealthDetermined by pings; auto_down marks down.

Parameters

ParameterDefaultDescription
health_sensitivity100Decrement amount on failure; 0 disables.
recovery_interval1Seconds between recovery pings.
transaction_timeout30Message timeout in seconds.
retry_count2Retries for recoverable failures.
peer_timeout180Seconds before aliveness query.
avoid_asym_router_failure1Requires healthy remote NI for route up.
alive_router_check_interval60Router ping interval (seconds).
check_routers_before_use0Enable pre-use router checks.

Commands

CommandPurpose
lnetctl global showView global settings like health_sensitivity.
lnetctl set health_sensitivity <value>Set sensitivity.
lnetctl net show -v 3Show NI health (local).
lnetctl peer show -v 3Show peer NI health (remote).
lnetctl stats showView failure stats (resend_count, etc.).
lnetctl discover <nid>Force peer re-discovery.
lctl set_param lnet.peer_timeout=<seconds>Set peer timeout.

Recovery Mechanisms

Local vs. Remote Failures

TypeImpactRecovery
Local NIOutbound traffic; decrement local health.Failover to other local NIs; local resend/no-resend.
Remote NI/PeerInbound/outbound; detected by timeout.Remote resend/no-resend; no duplicates; discovery re-pings.

Integration with Multi-Rail and Routing

Recent Enhancements

Examples

Basic Tuning

lnetctl set health_sensitivity 100
lnetctl set recovery_interval 1
lnetctl set retry_count 3
lnetctl global show

Monitoring

lnetctl net show -v 3  # Local NI health
lnetctl peer show -v 3 # Remote health
lnetctl stats show     # Failure stats

Router Checker

lctl set_param lnet.alive_router_check_interval=60
lctl set_param lnet.check_routers_before_use=1

Client vs. Server: Same mechanisms; servers benefit more from router health in large clusters.