IRQ notification

Next: Throughput vs. Data Size: Up: Reducing Network Virtualization Overheads Previous: Send combining

IRQ notification

The third optimization is targeted at reducing host system calls for receiving notification of packet sends and receives. The VMApp establishes a piece of shared memory with the VMNet driver at initialization and the driver sets a bitvector whenever packets are available. Then, on every NIC IRQ, instead of an expensive select() on all of the devices, the VMApp checks the shared memory, receives any pending packets, and immediately returns to the VMM.

In summary, the three major optimizations applied are as follows: Lance related I/O port accesses from the virtual machine are handled in the VMM whenever possible. During periods of heavy network activity, packet transmissions are merged and sent during IRQ-triggered world switches. This reduces the number of world switches, the number of virtual IRQs, and the number of host IRQs taken while executing in the VMM world. Finally, the VMNet driver is augmented with shared memory that allows the VMApp to avoid calling select() in some circumstances. Figure 6 shows that these optimizations reduce CPU overhead enough to allow VM/PC-733 to saturate a 100 Mbit Ethernet link, and the throughput for VM/PC-350 more than doubles. Table 2 lists the CPU overhead breakdown from the time-based sampling measurements on VM/PC-733 with the optimizations in place. Overall, the profile shows that the majority of the I/O related overhead is gone from the VMM and that there is now time when the guest OS idles. Additionally, guest context switch virtualization overheads now become significant as the guest switches between nettest and its idle task. The ``Guest idle'' and ``Host IRQ processing while guest idle'' categories in Table 2 are derived with input from direct measurements presented in Section 3.5. A sample-based measurement of idle time indicates that 41.1% of VMM time is spent idling the guest and taking host IRQs while idling. However, discriminating the host IRQ processing time and guest idle time via time-based sampling alone is hard because of synchronized timer ticks and the heavy interrupt rate produced by the workload. We use direct measurements that show that 21.7% of total time is spent in the guest idle loop to arrive at the idle time breakdown in Table 2. The most effective optimization is handling IN and OUT accesses to Lance I/O ports directly in the VMM whenever possible. This eliminates world switches on Lance port access that do not require real I/O. Additionally, Table 1 indicates that accessing the Lance address register consumes around 8% of the VMM's time and taking advantage of the register's memory semantics has completely eliminated that overhead from the profile as shown in Table 2. An interesting observation is that the time to transmit a packet via the VMNet does not change noticeably - all of the gains are along other paths. Instrumenting the optimized version in appropriate locations shows that the average cycle count on the path to transmit a packet onto the physical NIC is within 100 cycles of the totals from Figure 5. However, this is contrary to the times in Table 2 for sending via the VMNet driver. This disagreement stems from transmitting more than one packet at a time. While simply sending and timing individual packets, the baseline and optimized transmits look very similar, but with send combining active, up to 3 packets are sent back to back. This increases the chance of taking a host transmit IRQ from a prior transmit while in the VMNet driver. Since Table 2 reports the time from the start to finish of the call into the VMNet driver, it also includes the time the host kernel spends handling IRQs.

Next: Throughput vs. Data Size: Up: Reducing Network Virtualization Overheads Previous: Send combining

Beng-Hong Lim 2001-05-01