USENIX Conference Policies
Block I/O
Jens Axboe led a session on the block I/O subsystem which, of course, has been massively reworked in the 2.5 series. His talk concentrated on the issues that remain to be resolved.
One of those issues is ordered writes and barriers. Journaling filesystems need, at times, to be sure that certain operations have completed before others can be started. Without write barriers, the transactional nature of the filesystem is lost. The infrastructure for write barriers is there now; what remains is to push the implementation down into the block drivers. For IDE drives, this will be done with cache flushes before and after the barrier. For SCSI drives, ordered tags can be used. That requires, however, that the SCSI layer use the generic tag code which has been implemented in the block layer; that work is in progress.
Multipage I/O still raises issues. Many I/O operations generated by the system are large; it is vastly preferable to keep them together so that they can be handled efficiently by the hardware. The problem is that, sometimes, the hardware can not handle large requests. Hardware limitations can come into play, or the block device could be a virtual device (a RAID or LVM device) which must split the request anyway.
One way of handling this problem is to split requests that turn out to be too large. But splitting is an ugly and inefficient process; it is best avoided. A better approach would be to involve the device driver in the construction of block I/O requests; a new interface would allow the requests to be built, page by page, with the driver telling the block layer when the request gets too large.
Even then, though, it seems that splitting may be necessary in some situations. The remaining cases could probably be solved in a simple way, however; the offending request could just be resubmitted one sector at a time. This solution is slow, but it shouldn't be needed that often.
Unlike the rest of the block I/O subsystem, I/O scheduling remains essentially unchanged since 2.4. The current elevator code works well in most situations, but one can always try to do better. Jens has been experimenting with a variation of the scheduler which would enforce an upper bound on the latency for any given request. The modified elevator can guarantee that any request will be executed within one second, with a 3% performance penalty. Lowering the deadline to 100ms raises the performance hit to 8%. Most people seemed to think that this penalty was acceptible. Future work could include prioritizing requests and "anticipatory scheduling" - delaying read requests slightly in the hope that they can be clustered with other requests.
Finally, the task of removing buffer heads from the block I/O subsystem continues.