Check out the new USENIX Web site. nextupprevious
Next:VM System Support Up:Kernel Support for DAFS Previous:Event-Driven Design Support


Vnode Interface Support

Vnode/VFS is a kernel interface that separates generic filesystem operations from specific filesystem implementations [20]. It was conceived to provide applications with transparent access to kernel filesystems, including network filesystem clients such as NFS. The vnode/VFS interface consists of two parts: VFS defines the operations that can be done on a filesystem. Vnode defines the operations that can be done on a file within a filesystem. Table 1 lists the vnode operations originally defined [20] to support NFS along with a number of local filesystems as well as later additions introduced into BSD systems with a unified file and VM cache to transfer data directly between the VM cache and the disk.
 


Table 1:Vnode ops (Sandberg et al.[20]).
Vnode operation Description
VOP_ACCESS Check access permission
VOP_BMAP Map block number
VOP_BREAD Read a block
VOP_BRELSE Release a block buffer
VOP_CLOSE Mark file closed
VOP_CREATE Create a file
VOP_FSYNC Flush dirty blocks of a file
VOP_GETATTR Return file attributes
VOP_INACTIVE Mark vnode inactive
VOP_IOCTL Do I/O control operation
VOP_LINK Link to a file
VOP_LOOKUP Lookup file name
VOP_MKDIR Create a directory
VOP_OPEN Mark file open
VOP_RDWR Read or write a file
VOP_REMOVE Remove a file
VOP_READLINK Read symbolic link
VOP_RENAME Rename a file
VOP_READDIR Read directory entries
VOP_RMDIR Remove directory
VOP_STRATEGY Read/write fs blocks
VOP_SYMLINK Create symbolic link
VOP_SELECT Do select
VOP_SETATTR Set file attributes
VOP_GETPAGES Read and map pages in VM
VOP_PUTPAGES Write mapped pages to disk

Existing vnode I/O interfaces are all synchronous. VOP_READ and VOP_WRITE take as an argument a struct uio buffer description and have copy semantics. VOP_GETPAGES and VOP_PUTPAGES are zero-copy interfaces transfering data directly between the VM cache and the disk. VM pages returned from VOP_GETPAGES need to be explicitly wired in physical memory to be used for device I/O. An interface for staging I/O should be designed to return buffers in a locked state. We believe that a vnode interface modeled after the low-level buffer cache interface with new support for asynchronous operation naturally fits the requirements of a DAFS server as outlined earlier. Such an asynchronous interface is easier to implement than an asynchronous version of VOP_GETPAGES, VOP_PUTPAGES, while being functionally equivalent to it in FreeBSD's unified VM and buffer cache.

Central to this new interface (summarized in Table 2) is a VOP_AREAD call which can be used to issue disk read requests and return without blocking. VOP_AREAD internally uses a new aread() buffer cache interface (described below) integrated with the kqueue mechanism. It takes as one of its arguments an asynchronous I/O control block (kaiocb) used to keep track of progress of the request.

                aread(struct vnode *vp, struct kaiocb *cb) 
                {
                        derive block I/O request from cb;
                        bp = getblk(vp, block request); 
                        if (bp not found in the buffer cache) { 
                                register kevent using EVFILT_KAIO; 
                                register kaio_biodone handler with bp; 
                                VOP_STRATEGY(vp, bp); 
                        }
                }
On completion of a request issued by aread(), the data is in a bp, in a locked state, and kaio_biodone() is called to deliver the event:
                kaio_biodone(struct buf *bp) 
                {
                        get kaiocb from bp;
                        deliver event to knote in klist of kaiocb;
                }
To unlock buffers and update filesystem state if necessary, VOP_BRELSE is used. Local filesystems would implement the interface of Table 2 in order to be efficiently exported by a DAFS server. For lack of this or another suitable interface, a local filesystem could always be exported by a DAFS server using existing interfaces, albeit with higher overhead mainly due to multithreading.
 


Table 2:Vnode Interface to Buffer Cache.
Vnode operation Description
VOP_BREAD Lock all buffers needed for I/O; read from vp.
VOP_AREAD Lock all buffers needed for I/O; read from vp; don't block .
VOP_BDWRITE Mark dirty entries; delayed write to vp; update state if requested.
VOP_BWRITE Block writing to vp; update state if requested.
VOP_BAWRITE Mark dirty entries; async write to vp; update state if requested.
VOP_BRELSE Unlock buffers; update file state if requested.

Network event delivery can be integrated with that of disk I/O as described earlier through the kevent mechanism. Each time an RDMA descriptor is issued, a kevent is registered using the EVFILT_RDMA filter and recorded in the completion group (CG) structure. Completion group handlers need to deal with kqueue event delivery:

                send_event(CG *cq, Transport *vi) 
                {
                        deliver event to knote in klist of CG;
                }
The DAFS server is notified of new events by periodically polling the kqueue. Alternatively, a common handler is invoked each time a network or disk event occurs. We illustrate the use of the proposed vnode interface to the buffer cache by breaking down and describing the steps in read and write operations implemented by a DAFS server. For comparison with existing interfaces, we describe the same steps implemented by NFS. Without loss of generality we assume an FFS underlying filesystem at the server.

Read. With DAFS, a client (direct) read request carries the remote memory address of the client buffers. The DAFS server issues a VOP_AREAD to read and lock all necessary file blocks in the buffer cache. VOP_AREAD starts disk operations and returns without blocking, after registering with kqueue. When pages are resident and locked and the server notified via kqueue, it issues RDMA Write operations to client memory for all requested file blocks. When the transfers are done, the server does VOP_BRELSE to unlock the file buffer cache blocks.

With NFS, on a client read operation the server issues a VOP_READ to the underlying filesystem with a uio parameter pointing to a gather/scatter list of mbufs that will eventually form the response to the read request RPC. In the FFS implementation of VOP_READ and without applying any optimizations, a loop reads and locks file blocks into the buffer cache using bread(), subsequently copying the data into the mbufs pointed to by uio. For page-aligned, page-sized buffers, page-flipping techniques can be applied to save the copy into the mbufs.
 

Write. With DAFS, a client (direct) write request carries only client memory addresses of data buffers. The DAFS server uses VOP_AREAD to read and lock in the buffer cache all necessary file blocks. When pages are resident and locked, it issues RDMA Read requests to fetch the data from the client buffers directly into the buffer cache blocks. When the transfer is done, the server uses one of VOP_BWRITE, VOP_BDWRITE, VOP_BAWRITE, depending on whether this is a stable write request or not, to issue a disk write I/O and unlock the buffers. Additionally, a metadata update is effected if requested.

With NFS, a client write operation carries the data to be written inline with the RPC request. The NFS server prepares a uio with a gather/scatter list of all the data mbufs and calls VOP_WRITE. Apart from the uio parameter that describes the transfer, an ioflags parameter is passed signifying whether the write to disk should happen synchronously. With NFS version 2 all writes and metadata updates are synchronous. NFS versions 3 and 4 allow asynchronous writes. In the FFS implementation of VOP_WRITE, a loop reads and locks the file blocks to be written into the buffer cache using bread(), copies into them the data described by uio, then uses one of bwrite (synchronous), bdwrite (delayed), or bawrite (asynchronous) writes depending on whether this is a stable write request (see ioflags) or not. Finally, a metadata update is effected if requested.

An interesting note on the ability of the DAFS server to implement file writes using RDMA Read (instead of client-initiated RPC or RDMA Write) is that this enables it to read data from client memory no faster than dirty buffers can be written to disk. This bandwidth matching capability becomes very important in multi-gigabit networks when network bandwidth is often greater than disk bandwidth. 


nextupprevious
Next:VM System Support Up:Kernel Support for DAFS Previous:Event-Driven Design Support
Kostas Magoutis 2001-12-03