Linked Presentation: Minder: Faulty Machine Detection for Large-scale Distributed Model TrainingByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development