Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Swaminathan Sundararaman, Xing Lin, and Robert Ricci
Understanding fault models is an important criterion for building robust systems. Decades of research have developed mature failure models such as failstop, failpartial, failtransient, and Byzantine failures. We highlight an understudied “new” failure type: failslow hardware, i.e., hardware that is still running and functional but in a degraded mode, i.e., slower than its expected performance. We found that all major hardware components can exhibit failslow faults. For example, disk throughput can drop by three orders of magnitude to 100 KB/s due to vibration; CPUs can unexpectedly run at halfspeed due to lack of power; and network card performance can collapse to Kbps level due to buffer corrup tion and retransmission.