Implementation Techniques

Next: Performance Up: Hit-Server Implementation Previous: Analysis

3.2 Implementation Techniques

From the previous analysis, we know that every 14 us a 1K packet has to be sent to achieve maximum hardware utilization. However, we had to design the system for an even higher demand: with optimal PCI-bus DMA hardware (1 word per 30ns cycle), a packet had to be transmitted every 8 us³.Obviously, the hit-server software must be constructed carefully not to delay packet transmission. The following paragraphs discuss the methods we used to achieve these goals.

Early Evaluation

Early evaluation is a technique for reducing the latency and improving the throughput of operations. If an operation is requested several times on the same object, it needs to be executed only once. If an operation can be executed either at a place or at a time when free resources are available, its costs are hidden.

Object precompilation ensures that only negligible computations or data transformations are required by the hit-server for delivering a cached object to a client. When loading the object into the hit-server cache, the miss-server partitions it into network packets, generates the appropriate header information and computes a client-independent digest for each packet. On sending a packet, only the destination address, sequence number and sometimes a message-authentication code have to be generated.

To meet our security requirements, any delivered data has to be secured by a client-specific message-authentication code which is also unique in time to prevent replay attacks. By using a client-specific secret key, the authentication code is calculated from the precompiled client-independent digest and a nonce.

Many research experiments show that avoiding unnecessary data copies improves performance,(e.g., Fbufs [6], Unet [19]). Since we use precompiled packets in the object cache, we can always use unbuffered transmission for delivery. With put operations, the arriving 1K packets are linked, not copied, into the new version of the object (see also Section 2.3).

Per-Object Address Spaces

The default putget operation is implemented by the hit-server in a single address space. Since we would like to enforce security requirements on active objects, they execute their specialized operations in per-object address spaces.

Remember that the object granularity for put updates is 1K while hardware pages are 4K. Therefore, once a 4K region of an object with a non-default putget operation is no longer physically contiguous, the corresponding page must be removed from the object's address space. Upon the page fault on this page, the corresponding 1K chunks are physically copied into a fresh 4K page frame which is then mapped into the object's address space. Object-specific putgets often do not read the entire object data by itself but simply specify to the hit-server core what parts should be transmitted. In these cases, the above mentioned lazy-copying technique avoids copying of received put data packets even for non-default objects.

The u-kernel can be configured to support up to 65,000 address spaces. The space costs per address space are low for small objects, about 22 bytes for an object of 16 K. Nevertheless, the maximum number of address spaces is only 6.5% of the maximum number of objects. Currently, we do not yet know whether this is in practice sufficient to preallocate and preconstruct an address space for any object that uses non-default get and put operations. Otherwise, the hit-server would have to multiplex address spaces for active objects.

Minimizing Memory Conflicts

In the hit-server case, copying data in main memory is not only ``in principle avoidable'' but belongs to the class of the most expensive operations. Section 3.1 illustrated that the memory bus is the time-critical bottleneck. Therefore, processor accesses to main memory have to be minimized. Since L1 and L2 caches use a write-back strategy, the processor's reads and writes are uncritical as long as they hit in the hardware cache and do not touch the memory bus.

Fortunately, early evaluation techniques, in particular object precompilation, prevents unnecessary memory-to-cache copies. Since we use a precalculated digest, there is no need for the default get operation to read the object data.

The code segments of the u-kernel and the hit-server core are small enough so that their frequently used parts fit completely even into the L1 cache. The hit-server's memory bus is not burdened with handling instruction misses. For data, the situation is different:

Client descriptors, basically the client's secret key and some status information, are not expected to cause frequent L2-cache misses, since they need only 2 cache lines per client. 400 simultaneously active clients need 10% of the L2 cache.
Hash and name table serve to identify a requested object. Since they are large, most accesses will cause L2-cache misses. Finding a 100-byte name then requires to read 1 line of the hash table and 4 lines of the name table, provided there is no name-hash conflict.
Object descriptors will frequently miss the L2 cache. Due to the large number of objects, we generally assume that the object-descriptor data is never found in the L2-cache upon a get request. Applications with many hot-spot objects will perform slightly better. Memory accesses are minimized by keeping object descriptors small:
1. The object root holds pointers to the object-specific operation, the object-page descriptor list and the object's ACL. 2 cache lines are accessed per request, one from the ACL and one to find the packet descriptors.
2. Any object-page descriptor contains the pointer to 4 packets forming the page and the corresponding 4 precalculated digest values. Together with links and a length field (objects can be smaller than a page) this fits into one cache line.
Object data is never read by default get operations. So no L2-cache misses occur in this case. Writing an object using a put operation requires message-authentication codes of the received data to be verified. This costs 1024/32 = 32 cache misses per 1K packet. (Checking the authentication code is omitted if the packet comes from a miss-server, since miss-servers are trusted and the inter-server interconnection is a closed network.)
Packet headers have to be constructed per packet transmission. For our hardware, packet header and the buffer descriptor required for the Ethernet controller together fit into one cache line. Since the Ethernet controller always transfers data from/to main memory, one L2-cache-line write and one read is required per transmission.
Request data is placed in main memory by the receiving Ethernet controller. Obviously, a request packet has to be read by the hit-server. For requesting an object with a 100-byte name, 4 cache lines have to be transferred.

Therefore in total, a default get on an nK object requires at least 11+ [9n/4] cache-line transfers between L2 cache and main memory.

³ From the above discussion, we know that transmitting a 1024B packet and a 32B header requires (1+1)_step1 + (1+1)_step5 + (256+8)_step6 = 271 word transfers. So in the ideal situation, 1 word per 30ns through the PCI bus and no delay by the memory bus, 217 x 30ns = 8.13us.

Next: Performance Up: Hit-Server Implementation Previous: Analysis

Vsevolod Panteleenko
Tue Apr 28 11:56:10 EDT 1998