Yuliang Li, Harvard University; Rui Miao, Alibaba Group; Mohammad Alizadeh, Massachusetts Institute of Technology; Minlan Yu, Harvard University
TCP performance problems are notoriously tricky to diagnose because a subtle choice of TCP parameters or features may lead to completely different performance. A gold standard for diagnosis is to collect packet traces and trace TCP executions. However, it is not easy to use such tools in large-scale data centers where many TCP connections interact with each other. In this paper, we introduce DETER, a deterministic TCP replay tool, which runs lightweight recording all the time at all the hosts and then replay selected collections where operators can collect packet traces and trace TCP executions for diagnosis. The key challenge for deterministic TCP replay is the butterfly effect---a small timing variation causes a chain reaction between TCP and the network that drives the system to a completely different state in the replay. To eliminate the butterfly effect, we propose to replay individual TCP connection separately and capture all the interactions between a connection with the applications and the network. Our evaluation shows that \system has low recording overhead and can help diagnose many TCP performance problems such as long latency related to zero-window probes, late fast retransmission, frequent retransmission timeout, to problems related to the switch shared buffer.
NSDI '19 Open Access Sponsored by NetApp
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.