Lustre High Availability

Lustre achieves high availability (HA) through mechanisms like failover, multi-rail networking, file-level redundancy (FLR), and integration with clustering tools such as Pacemaker. High availability ensures that the filesystem remains accessible and operational even if components fail, minimizing downtime in critical environments like high-performance computing (HPC) clusters. This guide covers core concepts, setup procedures, best practices, commands, and updates for Lustre 2.17.0 (released in December 2025, current as of January 2026), based on the Lustre Operations Manual (updated 2025). For beginners: Lustre HA is essential in large-scale systems where a single failure could halt operations for thousands of nodes. Always plan HA during initial deployment to avoid retrofitting challenges.

Warnings:

Additional Best Practices:

Core Concepts

Understanding these concepts is key for beginners: Lustre separates metadata (file names, permissions) from data (file contents), allowing independent scaling and failover. HA focuses on redundancy at each layer to eliminate single points of failure.

ConceptDescription
FailoverAutomatic switch to a backup component (e.g., MDT or OST) when the primary fails. This eliminates single points of failure (SPOF). Supports active/passive (one active, one idle) and active/active (both serving load) modes. Requires shared storage like SAN or RAID arrays. For beginners: Think of it like a backup generator that kicks in during a power outage.
Failover Configurations- Active/Passive: One node active, the other on standby. Simple but underutilizes hardware. - Active/Active: Both nodes serve traffic, with resources split (e.g., half OSTs each). Maximizes utilization but requires careful load balancing. - Failout Mode: Returns errors (EIO) instead of blocking I/O during failures—use sparingly as it can disrupt applications. - Failback: Automatically or manually returns control to the primary after recovery.
Failover for Components- MDT Failover: Multiple Metadata Servers (MDS) share MDT storage. Typically active/passive, but active/active possible with different MDTs per MDS. - OST Failover: Multiple Object Storage Servers (OSS) share OST storage. Often active/active with OSTs distributed. - Failover NIDs: Use --servicenode for primary + backups or --failnode for backups only. Ensures clients know where to reconnect.
Multi-RailUses multiple network interfaces per node for redundancy and bandwidth. Supports hardware (e.g., InfiniBand RDMA) and software (LNet since 2.10). Provides fault tolerance via routing and health checks. Beginners: Like having multiple lanes on a highway—if one clogs, traffic reroutes.
FLR (File Level Redundancy)Introduced in 2.11, mirrors files across OSTs for data protection and faster reads. Primary mirror updates first; others resync later. Great for critical data. Warning: Resync can consume bandwidth—schedule during low-usage periods.
MMP (Multiple-Mount Protection)Prevents corruption from simultaneous mounts on shared storage. Uses sequence numbers and delays mounts by 10 seconds. Enable on all HA targets. Essential for shared block devices.
LNet HASupports diverse networks (TCP, InfiniBand). Features routing, dynamic discovery (2.10+), asymmetrical routes (2.13), and health monitoring (2.12+). Ensures network resilience.
LNet HealthScores interfaces (0-1000). Drops on failures (e.g., resends); recovers via pings. Types include local/remote resends. Tune for sensitive environments.
Integration with PacemakerUse with Corosync for cluster management. Handles detection, fencing (STONITH via IPMI/BMC), and failover. Requires PowerMan for power control. Beginners: Pacemaker is like an orchestra conductor ensuring all parts play in harmony.
Imperative Recovery (IR)Clients drive recovery after failures. MGS notifies status. Tune with ir_factor and recovery timeouts for faster resumption.
Version-Based Recovery (VBR)Uses inode versions to bridge replay gaps, reducing evictions. Improves stability in flaky networks.
Commit on Share (COS)Commits transactions to disk to avoid chain evictions. Enable via mdt.*.commit_on_sharing.
Client EvictionRemoves slow clients to protect servers. Invalidates locks; triggered by failed pings.
Metadata ReplayReplays missed requests post-failover using XIDs and transaction numbers.
Reply ReconstructionRebuilds lost replies from XID, transaction, and results.
DNE (Distributed Namespace Environment)Spreads metadata across MDTs. DNE1 (remote dirs), DNE2 (striped dirs, 2.8+). Scales metadata performance.
PFL (Progressive File Layout)Composite layouts with delayed instantiation (2.10). Adapts striping as files grow.
SEL (Self-Extending Layout)Extends PFL (2.13). Auto-swaps components on space issues. Policies: extend, spillover, etc.
DoM (Data on MDT)Stores small files on MDT (2.12). Use -L mdt; limited by dom_stripesize (1MB default).
LSoM (Lazy Size on MDT)Lazily updates file sizes on MDT (2.13). Resync via open/close or background.
OST PoolsGroups OSTs (e.g., by type or location). Used in FLR for domains. Create with lctl pool_new/add.

Setup

Setting up HA requires careful planning. Start with hardware: Ensure shared storage is reliable. Install Lustre servers first (see installation guides). Use root access for commands.

Failover Setup

Hardware requirements: Shared storage via SAN or RAID. Use RAID 1/10 for MDT (metadata-critical), RAID 5/6 for OST (data). Avoid running clients on MDS/OSS nodes to prevent resource contention. HA software like Pacemaker handles node failures via fencing.

# Format with failover
# Explanation: Specifies service NIDs for primary and failover nodes during formatting.
mkfs.lustre --servicenode=nid1,nid2 --ost ... /dev/sdX

# Update existing target
# Explanation: Adds or modifies failover NIDs on an unmounted device.
tunefs.lustre --servicenode=nid1,nid2 /dev/sdX

# Mount with multiple nodes
# Explanation: Clients mount using all possible NIDs for automatic failover.
mount -t lustre nid1:nid2@tcp:/fs /mnt/ost1

# Enable failout mode (rare)
# Explanation: Returns errors instead of blocking—use only if applications handle EIO gracefully.
mkfs.lustre --param=failover.mode=failout --ost ... /dev/sdX

Multi-Rail Setup

Configure multiple interfaces for redundancy. Beginners: LNet is Lustre's networking layer; multi-rail bonds them logically.

# LNet Configuration (Lustre 2.7+)
# Explanation: Loads LNet and adds networks with multiple interfaces.
modprobe lnet
lnetctl lnet configure
lnetctl net add --net tcp0 --if eth0,eth1
lnetctl peer add --prim_nid nid --nid nid1,nid2
lnetctl route add --net tcp2 --gateway nid
# YAML Import/Export
# Explanation: Use YAML for consistent configs across nodes.
lnetctl yaml_import config.yaml
lnetctl yaml_export config.yaml
# Dynamic Discovery (2.11+)
# Explanation: Automatically discovers peer interfaces.
lnetctl set discovery 1
lnetctl discover <peer_nid>
# Asymmetrical Routes (2.13)
# Explanation: Drops asymmetric routes for security.
lnetctl set drop_asym_route 1

FLR Setup

File mirroring for redundancy. Place mirrors in separate failure domains (e.g., different OSSs).

# Create Mirrored File
# Explanation: Creates a file with 2 mirrors, 4M stripe, etc. -p specifies pools.
lfs mirror create -N 2 -S 4M -c 2 -p flash -N -c -1 -p archive /mnt/lustre/file1
# Mirror Operations
# Explanation: Extend adds mirrors; split removes; resync updates stale ones; verify checks integrity.
lfs mirror extend -N 2 /mnt/lustre/file1
lfs mirror split --mirror-id 1 -d /mnt/lustre/file1
lfs mirror resync /mnt/lustre/file1
lfs mirror verify -v --only 1 /mnt/lustre/file1
lfs mirror find dir --mirror-count +1
# OST Pools
# Explanation: Groups OSTs for targeted placement.
lctl pool_new testfs.pool1
lctl pool_add testfs.pool1 OST[0-10]

MMP Setup

Protects against double-mounts. Always enable on shared devices.

# Enable
# Explanation: Adds MMP feature to the filesystem.
tune2fs -O mmp /dev/sdX

# Disable
# Explanation: Removes MMP (not recommended for HA).
tune2fs -O ^mmp /dev/sdX

# Check
# Explanation: Verifies MMP status.
e2mmpstatus /dev/sdX

LNet Health Setup

Tune for proactive failure detection.

lnetctl set health_sensitivity 100
lnetctl set recovery_interval 1
lnetctl set transaction_timeout 30

Best Practices

These expand on core recommendations for reliable operations.

AreaRecommendations
StorageUse RAID 6 for OSTs with hot spares; monitor via mdadm. Back up MDT metadata weekly. Overprovision space by 20% for overhead.
HADeploy UPS for power stability. Use Pacemaker with fencing. Disable writeback cache on HBAs. Allocate ample RAM for journals (MDS: 128GB+, OSS: 64GB+).
NetworkDedicate NICs to Lustre; bond for redundancy. Use NTP. Isolate from client access if possible.
Multi-RailPrefer RDMA. Order ip2nets rules carefully. Enable auto_down and router checks. Test with lnetctl ping.
FLRMirrors in different domains (OSTs, OSSs, racks). Use 'prefer' for faster media. Schedule resyncs off-peak.
MMPAlways enable on HA targets to prevent corruption.
LNetUse IPs over hostnames. Minimize lustre.conf comments. Validate configs with lnetctl show.
GeneralOverprovision MDT inodes/space. Monitor with lfs df -i. Integrate with monitoring tools like Nagios.

Commands

Failover & Configuration

# Mark as degraded (testing)
lctl set_param obdfilter.*.degraded=1

LNet Management (lnetctl)

FunctionCommand
Networkslnetctl net add/del/show --net tcp0 --if eth0
Peerslnetctl peer add/del/show --prim_nid nid --nid nid1,nid2
Routeslnetctl route add/del/show --net tcp2 --gateway nid
Routinglnetctl set routing 1
Healthlnetctl show health
Discoverylnetctl set discovery 1
Asym Routeslnetctl set drop_asym_route 1
YAMLlnetctl yaml_import/del/show file.yaml
Statslnetctl show_stats

FLR Operations

lfs mirror create -N 2 file
lfs mirror extend -N --pool hdd file
lfs mirror split --mirror-id 1 file
lfs mirror resync file
lfs mirror verify file
lfs mirror find dir --mirror-count +1

Recent Updates (Lustre 2.8–2.17+)

Lustre evolves rapidly; check release notes for your version.

VersionKey HA Features
2.8Striped Directories (DNE2), Multiple reply data per client.
2.9llapi_path2fid, llapi_ladvise, recovery_time_soft/hard.
2.10Software multi-rail, YAML config, dynamic discovery, PFL, FLR intro.
2.11FLR operations, lfs mirror, llsom_sync, multi-rail routing, default quotas.
2.12LNet health monitoring, DoM, dom_stripesize, tab completion for lctl.
2.13Asymmetrical routes, SEL, LSoM, pool quotas.
2.14Jobstats, default dir striping, pool quotas interoperability, client encryption.
2.15Nodemap project IDs, lljobstat.
2.16Adaptive timeouts, del_ost, session-based JobID, lod.*.max_mdt_stripecount.
2.17Nodemap ID offset ranges.

Limitations & Dependencies

Additional Tips

For troubleshooting: Enable debug logging with lctl set_param debug=+ha. Join Lustre mailing lists for community support. Consider training or certification for complex setups. In cloud environments, use managed services for shared storage if available. Always benchmark post-setup with tools like IOR or mdtest to verify performance under HA.