Lustre Failover Examples
Lustre failover provides high availability by allowing backup servers to take over MDT (Metadata Target) or OST (Object Storage Target) services upon primary failure. Examples below use shared storage (e.g., SAN/RAID) and assume HA tools like Pacemaker for automatic switching. Based on Lustre 2.17.0 (January 2026), with principles from the Lustre Operations Manual (updated 2025). Failover does not provide data redundancy, use RAID or FLR for that. For beginners: Failover is like having a spare tire—if the main one fails, you switch seamlessly to keep going. This is crucial in HPC environments where downtime can cost thousands in lost compute time.
Warnings:
- Improper failover setup can lead to data corruption, especially without proper fencing (STONITH) to prevent split-brain scenarios where both nodes access the same storage simultaneously.
- Always test configurations in a lab environment before production; simulate failures to verify recovery times and data integrity.
- Shared storage must be highly reliable; failures there aren't covered by Lustre failover—use hardware RAID with monitoring.
- Reboots and unmounts during setup cause brief outages; plan maintenance windows.
- Ensure all nodes have identical Lustre versions and configurations to avoid compatibility issues during switchover.
Additional Best Practices:
- Implement monitoring with tools like Nagios or Prometheus to alert on node health before failures occur.
- Use virtual IPs for MDS/OSS to simplify client remounts during failover.
- Document failover procedures, including manual steps for emergencies when HA software fails.
- Combine failover with multi-rail networking for network redundancy.
- Regularly back up configuration files and test restores.
Prerequisites
Before setting up failover, ensure the following. These steps prepare your environment to avoid common pitfalls.
- Shared storage accessible by primary and failover nodes (e.g., /dev/sdX for MDT). Beginners: Shared storage means both servers can see the same disk, like a network-attached drive.
- Identical Lustre versions on all nodes. Check with
lustre_version. - HA software (e.g., Pacemaker/Corosync) for monitoring, fencing, and resource management. This automates the switchover.
- Enable MMP (Multiple-Mount Protection) for corruption protection:
tune2fs -O mmp /dev/sdX. This prevents accidental double-mounts. - Configure LNet NIDs (e.g., 192.168.1.1@tcp for primary, 192.168.1.2@tcp for failover). Use
lctl list_nidsto verify.
MDT Failover Examples
MDT handles metadata operations, so failover here ensures quick recovery of file listings and permissions. Active/active uses DNE (Distributed Namespace Environment) for scaling.
Active/Active MDT Failover (with DNE)
Two active MDS nodes, with one active MDT on each. Failover node takes over on failure. This maximizes hardware use but requires careful balancing.
# Format MDT0 with failover
# Explanation: --fsname sets the filesystem name, --mgs for management if combined, --mdt for metadata target, --index assigns ID, --servicenode specifies primary and failover NIDs.
mkfs.lustre --fsname=testfs --mgs --mdt --index=0 --servicenode=mds1@tcp --servicenode=mds2@tcp /dev/sdX
# Format MDT1 with separate failover
# Explanation: Similar, but index=1 for second MDT; nodes swapped for load balancing.
mkfs.lustre --fsname=testfs --mdt --index=1 --servicenode=mds2@tcp --servicenode=mds1@tcp /dev/sdb
# Mount on respective primaries
# Explanation: Creates mount point and mounts the target. Do this on the primary node for each.
mkdir /mnt/testfs-mdt0000
mount.lustre /dev/sdX /mnt/testfs-mdt0000 # On mds1
mkdir /mnt/testfs-mdt0001
mount.lustre /dev/sdb /mnt/testfs-mdt0001 # On mds2
# On failover node (passive, do not mount until failover)
# Pacemaker config example (simplified)
# Explanation: Creates filesystem resource for monitoring mount status.
pcs resource create mdt0_fs Filesystem device="/dev/sdX" directory="/mnt/testfs-mdt0000" fstype="lustre" op monitor interval=60s
# Explanation: Creates Lustre-specific resource.
pcs resource create mdt0 Lustre target="/mnt/testfs-mdt0000" op monitor interval=60s
# Explanation: Groups resources together for atomic failover.
pcs resource group add mdt0_group mdt0_fs mdt0
pcs resource meta mdt0_group target-role="Started"
# As above, create separate groups for each MDT
# Client mount with failover NIDs
# Explanation: Clients specify multiple NIDs for automatic reconnection on failure.
mount -t lustre mds1@tcp:mds2@tcp:/testfs /mnt/testfs
After setup, verify with lfs df -h on clients. If issues, check logs with dmesg | grep Lustre.
OST Failover Examples
OST failover focuses on data storage; active/active distributes load for better performance.
Active/Active OST Failover
Two OSS nodes share OSTs, each serving half; on failure, survivor takes all. This is efficient for large clusters.
# Format OST with failover
# Explanation: --ost for object storage, --mgsnode points to MGS, --servicenode for failover.
mkfs.lustre --fsname=testfs --ost --index=0 --mgsnode=mgs@tcp --servicenode=oss1@tcp --servicenode=oss2@tcp /dev/sdc
# Mount on oss1 (active for this OST)
# Explanation: Mount on the primary OSS.
mkdir /mnt/testfs-ost0000
mount.lustre /dev/sdc /mnt/testfs-ost0000
# Pacemaker config
# Explanation: Similar to MDT, creates filesystem and Lustre resources.
pcs resource create ost0_fs Filesystem device="/dev/sdc" directory="/mnt/testfs-ost0000" fstype="lustre"
pcs resource create ost0 Lustre target="/mnt/testfs-ost0000"
pcs resource group add ost0_group ost0_fs ost0
Manual Failover Test
Simulate failure and switch. Useful for testing without HA software.
# On primary: Unmount and stop
# Explanation: Cleanly unmount to avoid errors.
umount /mnt/testfs-ost0000
# On failover: Mount
# Explanation: Mount on the backup node.
mount.lustre /dev/sdc /mnt/testfs-ost0000
# Mark degraded (optional during RAID rebuild, prevents new allocations)
# Explanation: Sets degraded flag to avoid writing new data during recovery.
lctl set_param obdfilter.testfs-OST0000.degraded=1
Monitor recovery with lctl get_param obdfilter.*.recovery_status.
Integration with Pacemaker
Pacemaker handles automatic failover. It's a cluster resource manager that detects failures and migrates resources.
# Install
# Explanation: Installs Pacemaker and dependencies on RHEL-like systems.
dnf install pacemaker corosync pcs fence-agents-all
# Configure cluster
# Explanation: Authenticates nodes and sets up the cluster.
pcs cluster auth node1 node2
pcs cluster setup --name lustre_ha node1 node2
pcs cluster start --all
pcs cluster enable --all
# Add STONITH (fencing)
# Explanation: Configures fencing device (e.g., IPMI) to power off failed nodes, preventing corruption.
pcs stonith create fence_ipmi fence_ipmilan pcmk_host_list="node1 node2" ipaddr="bmc_ip" login="admin" passwd="pass" lanplus=1
# Add Lustre resources (as above)
Monitor: pcs status. For SLES, use zypper instead of dnf. Test by killing processes or pulling cables.
Best Practices
These practices ensure reliable failover operations.
- Use RAID for underlying storage redundancy for ldiskfs. For example, RAID 6 for OSTs to handle multiple disk failures.
- Enable MMP for ldiskfs and ZFS to prevent multiple mounts. Check with
e2mmpstatus /dev/sdX. - Test failover regularly with simulated failures. Use tools like
pcs resource failto trigger. - Combine with FLR for data-level redundancy. Mirror critical files across OSTs.
- Tune timeouts:
lctl set_param timeout=300. Adjust based on network latency. - Avoid single points of failure in network/storage. Use bonded NICs and redundant paths.
For recovery tuning, see manual sections on IR (Imperative Recovery) and VBR (Version-Based Recovery). No major failover changes in 2.17, but check release notes for minor fixes.
Additional Tips
For large-scale deployments, consider active/active for all components to utilize hardware fully. Enable debug logging during tests: lctl set_param debug=+ha. Join Lustre community forums for real-world advice. In cloud setups, use managed shared storage like EBS with multi-attach. Always benchmark post-failover performance with IOR to ensure no degradation.