Check out the new USENIX Web site. next up previous
Next: GMS/Trapeze File Access Speed Up: Performance Previous: Performance

RPC Microbenchmarks

Table 1 shows latency and bandwidth results from kernel-kernel RPC microbenchmarks using 16-byte control messages and payload sizes of 0 bytes, 4K bytes, and 8K bytes. In these experiments the request message is a 16-byte control message that generates a reply with an attached payload. The client is a Miata; the server(s) are Alcors. For these experiments, Trapeze was configured to use DMA for control messages in order to reduce overhead at the cost of higher latency.

The table presents measurements for ordinary request/response RPC (2-way) and delegated (3-way) RPC, for three reply-handling variants: traditional blocked caller (wait), nonblocking continuation (cont), and deferred continuation (defer). For the replies carrying payloads, we measured the effect of three payload handling schemes: (1) ``solicited'' payloads received into frames mapped by the Trapeze IPT, (2) ``unsolicited'' payloads received into a generic payload buffer attached to the receive ring entry (as for a received GMS putpage or movepage), and (3) ``unsolicited with copy'', in which the received payload is copied from a generic payload buffer into a reply buffer not mapped through the IPT. The third variant is intended to demonstrate the value of the IPT for copy-free reply handling.

What is important here is the low incremental cost and high bandwidth of pagesize payloads, and the effects of the payload handling techniques presented in Sections 2 and 3. For example, we can determine from the 2-way wait numbers that the marginal transfer latency of a solicited payload is about 54$\mu$s for 4K and 92$\mu$s for 8K, including the cost to map the receiving frame through the IPT. With nonblocking RPCs and continuations, gms_net preserves 87% of the 88 MB/s of bandwidth that raw Trapeze provides with 4K payloads on this platform, and over 92% of the 105 MB/s of raw Trapeze bandwidth using 8K payloads. Interestingly, at most 7% of the remaining throughput is sacrificed by copying the payload at the receiver; this reflects the excellent memory system bandwidth of the Pyxis-based Miata (this is also apparent on our Intel platforms using the new 440LX chipset).

Several other points are worthy of note. Nonblocking RPCs show a modest improvement in latency because there is no process context switch to handle the reply; however, that benefit is more than lost if the continuation must be deferred. Delegated (3-way) RPCs -- which are common for shared file accesses in GMS -- exact a high price in latency, but have little effect on bandwidth. Solicited payloads are even cheaper than unsolicited payloads, despite the need to set up and tear down an IPT entry; this is apparently due to the cost of returning the received buffer to the VM page frame pool, and allocating a new one to replace the buffer frame lost from the receive ring entry.


next up previous
Next: GMS/Trapeze File Access Speed Up: Performance Previous: Performance
Darrell Anderson
1998-04-27