Check out the new USENIX Web site. next up previous
Next: Reducing Network Virtualization Overheads Up: Virtual Machine Networking Performance Previous: Experimental Setup

Packet Transmit Overheads


Table 1: Distribution of CPU time during network transmission. The largest overheads are I/O space accesses requiring a world switch to the VMApp and the time spent handling them once in the VMApp.
Total Time
Category Percent Time Average Time
VMM Time 77.3% N/A
Transmitting via the VMNet 8.7% 13.8 $\mu$s
Emulating the Lance status register 4.0% 3.1 $\mu$s
Handling host IRQs (device interrupts) 3.4% N/A
Emulating the Lance transmit path 3.3% 5.2 $\mu$s
Receiving via the VMNet 0.8% 1.8 $\mu$s
VMM Time
Category Percent Time Average Time
IN/OUTs requiring switching to the VMApp 26.8% 7.45 $\mu$s
Instructions not requiring virtualization 22.0% N/A
General instructions requiring virtualization 11.6% N/A
IN/OUTs handled in the VMM 8.3% 1.36 $\mu$s
IN/OUTs to the Lance Address Port 8.1% 0.74 $\mu$s
Transitioning to/from virtualization code 4.8% N/A
Virtualizing the IRET instruction 4.8% 3.93 $\mu$s
Delivering virtual IRQs (device interrupts) 4.6% N/A
   


The first series of experiments investigates the behavior of sustained virtual machine TCP transmits from VM/PC-733 to PC-350. We configure nettest to send 100 megabytes using 4096-byte read()s and write()s. With VMware Workstation 2.0, we find that the workload is CPU bound with an average throughput over 30 consecutive runs of 64 Mb/s. The workload is then instrumented to determine where the CPU time is spent. The first instrumentation gauges the time spent transmitting a packet by reading the the Pentium Processor's Time Stamp Counter (TSC) register [10] at key points during the virtualization of the OUT instruction that triggers a packet transmission. The TSC allows a measurement of the total cycle count of the path, plus internal breakdowns of interesting subsegments. Figure 5 presents the latency involved along the instrumented network transmit path on PC-733. It takes a total of 0.57+4.45+1.23+17.55 = 23.8 $\mu$s from the start of the OUT instruction until the return from the VMNet system call that puts the packet on the wire. End-to-end, it takes 31.63 $\mu$s from the start of the OUT instruction that triggers a packet transmission until control is returned to the virtual machine and the next guest OS instruction is executed. Of those 31.63 $\mu$s, 30.65 $\mu$s is spent in world switches and in the host world. Assuming the 17.55 $\mu$s of VMNet driver time in the host world is dominated by the unavoidable cost of actually transmitting the packet, we find that hosted virtualization architecture imposes 30.65 - 17.55 = 13.10 $\mu$s of overhead that would not be present if the VMM talked directly to the host NIC. This overhead alone does not explain why the workload is CPU bound. At 31.63 $\mu$s per 1520-byte packet, it only takes roughly 0.26 seconds to transmit 100 megabits. Each packet transmission actually involves a series of 11 other IN/OUT instructions issued by the guest Ethernet driver as well as interrupt processing and virtualization overheads. To investigate these other overheads, the next set of experiments uses time-based sampling to profile the distribution of time spent in the VMM and VMApp over the entire workload. The samples measure the percentage of time spent in code sections and the number of samples that hit a section (when available). This gives a more comprehensive picture of the overheads present in transmitting packets and reveals some unnecessarily expensive paths. Table 1 summarizes the highest categories. The profile shows that more than a quarter of the time in the VMM is spent preparing to call the VMApp because of an I/O instruction, recording the result and then returning to the virtual machine. Additionally, each of those transitions also cost a world switch from the VMM to the host and back, which was calculated at around 8.90 $\mu$s on PC-733 above (the switch time is part of the 77.3% running the VMM, but not part of any of the VMM Time numbers). Given that an I/O instruction on native hardware completes in a matter of tens of cycles, this is easily two orders of magnitude slower. The other significant source of overhead is spread through the categories in Table 1: IRQ processing. The virtual AMD Lance NIC as well as the physical Intel EtherExpress NIC raises an IRQ (device interrupt) on every packet sent and received. Thus, the interrupt rate on the machine is very high for network-intensive workloads. On a hosted architecture, each IRQ that arrives while executing in the VMM world runs the VMM's interrupt handler then switches to the host world. The host world runs the host OS's interrupt handler for that IRQ, and passes control to the VMApp to process any resulting actions. If the IRQ pertains to the guest (e.g., the IRQ indicates that a packet was received that is destined for the guest), the VMApp will then need to deliver a virtual IRQ to the guest OS. This involves switching back to the VMM world, delivering an IRQ to the virtual machine, and running the guest OS's interrupt handler. This magnifies the cost of an IRQ since VMM and host interrupt handlers as well as guest interrupt handlers are run. Additionally, virtual interrupt handling routines execute privileged instructions that are expensive to virtualize. In Table 1, most of the IN/OUTs handled in the VMM are accesses to the virtual interrupt controller and the majority of the IRET instructions are the guest interrupt handler finishing. Note also that the cost of servicing an interrupt taken in the VMM world is much higher than servicing an interrupt taken in the host world due to the VMM interrupt handler and a world switch back to the host. Yet another overhead in the hosted architecture which is not apparent from the raw profile is the inability of the VMApp and VMM to distinguish between a hardware interrupt which produces an event for the virtual machine (e.g., a packet to be delivered to the guest was received) from one that is unrelated to the virtual machine. Only the host OS and its drivers determine that. This leads to a balancing act: The VMApp can do nothing when the VMM returns to the VMApp on an IRQ, or it call select() in the VMApp. Calling select() too frequently is wasteful, whereas calling select() too infrequently may cause harmful delays in handling network I/O events.
next up previous
Next: Reducing Network Virtualization Overheads Up: Virtual Machine Networking Performance Previous: Experimental Setup
Beng-Hong Lim 2001-05-01