This paper describes VMware's hosted virtual machine architecture as implemented in VMware Workstation. This architecture enables VMware Workstation to support a wide variety of PC hardware without special device drivers and to present a constant and hence portable virtual hardware environment. Additionally, co-existing with an commodity operating system simplifies installation and use for users and reduces the complexity of the virtual machine monitor component for the developers.
The hosted architecture splits its functionality between a VMM component that virtualizes the CPU, and a VMApp component that runs as a normal application on a host OS and handles I/O to the native devices on behalf of a virtual machine. I/O intensive workloads, in addition to running significant amounts of privileged code, require heavy-weight world switches from the VMM back to the VMApp on the host. While this is unimportant for low bandwidth devices like keyboards or mice, it can potentially prevent more demanding devices from achieving the same I/O saturation as their native counterparts. This paper focuses specifically on NIC virtualization. It presents optimizations to VMware Workstation 2.0 that allow a virtual machine hosted on a 733 MHz Pentium III CPU to saturate the network without becoming CPU bound.
The key strategy behind all the implemented optimizations is to reduce the number of world switches. The first optimization takes advantage of the fact that only a fraction of the I/O accesses to the virtual NIC causes packets to be transmitted. The remainder do not require any access to the host hardware, allowing the VMM to handle them directly instead of switching back to the host world. This optimization alone reduces CPU utilization to the point where the network link is completely saturated on a 733 MHz CPU.
The second optimization reduces the remaining world switches and trims their overhead. When the world switch rate is high enough, rather than switch back to the VMApp immediately to send each packet, the VMM gathers up to 3 packets at a time before switching back to the VMApp to send them all at once. An extra benefit of this clustering is that transmit IRQs from the native NIC becomes more likely to arrive in the host world (while sending successive packets) than in the VMM world where they would require an immediate world switch.
The third optimization uses shared memory between the VMNet driver and the VMApp to reduce the need to issue select() calls from the VMApp. This optimization allows the VMApp to detect which NIC IRQ requires contacting the VMNet and which NIC IRQ can immediately switch back to the VMM without spending extra time in the VMApp. Together, these three optimizations reduce the CPU utilization of the 733 MHz CPU virtual machine to around 78%. The optimizations also more than double the achievable network throughput on a 350 MHz CPU virtual machine.
The experimental results confirm that CPU overheads of a hosted virtualization strategy can prevent an I/O intensive virtual machine workload from matching the performance of the same workload on native hardware. In the straightforward implementation, frequent I/O causes frequent world switches that artificially limit the I/O utilization because the workload becomes CPU bound. However, even while remaining within a hosted virtual machine architecture, we are able to eliminate spurious world switches and even restructure around seemingly mandatory crossings with significant reduction in CPU utilization to the point that a 733 MHz Pentium III system is I/O bound with plenty of CPU cycles to spare.
CPUs are constantly getting faster and a 733 MHz Pentium III is at or below entry level for today's corporate PCs. Further, very few desktop workloads saturate a full 100 Mbit link with any regularity or frequency. Taken in conjunction with the portability, device independence, and co-existence a hosted architecture provides, VMware Workstation's achievable I/O performance strikes a good balance between performance and compatibility for its target desktop usage. The balance may change of course when gigabit networks become prevalent, depending on how fast CPUs will be by then.