################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the Tenth USENIX System Administration Conference Chicago, IL, USA, Sept. 29 - Oct. 4,1996. For more information about USENIX Association contact: 1. Phone: (510) 528-8649 2. FAX: (510) 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org - The Brave Little Toaster Meets Usenet Karl L. Swartz - Network Appliance ABSTRACT Usenet volume has been growing exponentially for many years; this growth places ever-increasing demands on the resources of a netnews server, particularly disk space - and file system performance - for the article spool area. Keeping up with this demand became substantially more difficult when it could no longer be satisfied by a single disk, and since netnews is incidental to SLAC's research mission, we wanted to find a solution that could easily scale to meet future growth while requiring minimal system administration effort. In the process of evaluating the various solutions that were proposed, we developed benchmarks to measure performance for this specialized application, and were surprised to find that some of our beliefs and intuition were not supported by the facts. The alternatives considered by SLAC are described, as are the benchmarks developed to evaluate the alternatives, and the results of those benchmarks. While netnews is the application we examined, our experience will hopefully provide inspiration for others to more carefully evaluate their applications instead of using stock benchmarks that may not correlate well with the intended use. Our results may also break down some biases and encourage the reader to consider alternatives which might otherwise have been ignored. ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. Introduction This paper discusses work begun while the author was employed by the Stanford Linear Accelerator Center (SLAC). In 1992, Usenet had outgrown the then-current netnews server at SLAC. A study of growth trends, based on data gathered from our server and other sources, provided forecasting tools to guide the configuration of a server that would be adequate until early 1995 [1]. Despite an unprecedented surge in growth between mid-1993 and early 1995 [2], that server managed to outlive its original design life with minimal tinkering, lasting until mid-1995 before suffering a major collapse due to netnews volume. Even before that collapse, SLAC had been studying what to do for a next generation server. The laboratory's primary mission is research in high-energy physics and related fields, so the ongoing drain of resources for the care and feeding of an incidental service like Usenet was becoming an irritant, espe- cially in a time of shrinking budgets and layoffs. Four goals were established to guide specification of the new server: o Capacity and performance to accommodate projected growth in traffic and readership thru 1997, without reducing newsgroups or expiration times. o Simplicity of future growth. o Greater reliability. o Reduction in administrative labor costs. The capital equipment budget for this project was hardly munificent, imposing yet another constraint. (This was eased somewhat after it was shown that the initial budget would not even pay for the necessary disks.) The key to this problem was the second point, simplicity of future growth. Most of the reliability shortcomings of SLAC's earlier netnews servers had come as they neared their design limits and became increasingly susceptible to collapse when a minor surge in traffic overwhelmed their strained resources. By mid-1995, shutting down building power or networking for maintenance over a weekend could result in the netnews server spending several weeks to work through the resulting backlog. Nursing and tuning such a sickly server just so it could do the job was a major consumer of labor at times. For the holiday shutdown, netnews - along with mail and payroll - was deemed a critical system which would be fixed immediately, instead of waiting until after New Year's Day, because of the cost of recovering from a pro- tracted outage. Critical Server Resources The core of a netnews server consists of two large databases located on disk: the articles themselves; and the history file, which tracks which arti- cles have been received and when they should be removed. For a full news feed and a given set of expiration times, the size of each is roughly proportional to the number of articles accepted per week. From 1984 until mid-1993 this value grew at a remarkably consistent rate of approximately 67% per year, or doubling every 15 to 16 months [1, 3]. For nearly two years starting mid-1993, growth surged to a 100% annual rate before dropping back to the his- toric curve and perhaps even lower [2]. This history file is still relatively manageable despite exponential growth - with the generous expiration times used by SLAC, the history file and associated indexes will only require about 763MB at the end of 1997[1]. ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. The main problem is that in older versions of the dbz library used by C News and INN, the dbzagain() routine did not automatically reduce the size of the tag stored in the history.pag file as the history file grew beyond the 2n bytes initially planned for. SLAC discovered this when investigating why expire was taking over two days to run - it seemed to be trying to keep the entire history file in memory and was thrashing badly. Rebuilding the history dbz index provided a quick fix until updated software could be installed to keep the problem from recurring as the the relentless growth continued. Unfortunately, the article spool area is a far more difficult beast to tame. Using the SLAC server as an example again, 8.9GB will be needed at the start of 1997, growing to 14.9GB by the end of the year. Even the later, larger size wouldn't be too bad if not for the fact that it will consist of nearly 4.7 million files, with over 1.8 million new files being created (and nearly as many deleted) each week. The average rate is three file creates and deletes per second - greater capacity is needed to handle short-term surges. If that isn't bad enough yet, the problem gets worse if the articles are not all stored in a single file system. The reason for this is that the structure of news is mapped directly onto the file system. The hierarchical nature of newsgroup names becomes a direc- tory hierarchy, and each article is stored in its own file in the directory corresponding to its newsgroup. Cross-posted articles are implemented with links - preferably hard links, though symbolic links are used if necessary. This is the reason for wanting to have the entire article spool in a single file system, since a hard link requires just another directory entry, while a symbolic link imposes yet another file creation (and eventually deletion) along with hits on two file systems when referenced. Using multiple file systems also forces the news administrator to invest effort in guessing how to allocate newsgroups to the available file systems in a manner which balances both space and load. It may be difficult to change this allocation later, and a good balance now may not be good in the future if one set of groups grows faster than another. Managing such a setup is an intractable problem. Large File System Alternatives SLAC's netnews server was using two file systems for the article spool so we were all too familiar with the problems with that solution. It was fairly clear that a single, large disk would probably be a problem for performance, even if we could get one that was large enough (9GB, the largest readily available disk when the system was being acquired, would work, but not even into 1997) and access it as a single file system (with SunOS, we were limited to a 2GB file system). That would still leave the requirement for simplicity of future growth unaddressed. To get a single, large file system, Sun's OnLine: DiskSuite [4] appeared to be the answer. Our netnews server was a Sun (running SunOS 4.1.3) and other sites seemed to be successfully using it for news. This product increases the maximum size of a file system on SunOS from 2 gigabytes to 1 terabyte. It allows the creation of large volumes via striping (``RAID 0'') or non-inter- leaved concatenation of multiple disks. It also offers the option of higher availability and reliability through mirroring (RAID 1) [5] and hot spares. The choice between organizing the disks as a concatenation or a stripe set was difficult. Striping would seem to be better for performance since it spreads data over all disks. A superficial analysis of concatenation suggests it would fill most of one disk before moving on to the next one. However, the BSD Fast File System creates directories ``in a cylinder group that has a greater than average number of free nodes, and the smallest number of directo- ries in it,'' then tries to place files close to where their directory is located [6]. With the large number of directories in the article spool, one would expect data to be spread amongst disks fairly quickly. The weakness of RAID 0 is that the size of the stripe set is fixed when the RAID virtual device is created, whereas a concatenation can be expanded as needed. A file system on a stripe set can be expanded by concatenating another stripe set, but that may mean buying more disks than are required for the desired capacity. With the price of disks always dropping (while capacity and performance increase), buying disks well ahead of need is not appealing. An alternative is to concatenate just one or two disks, but that raises the same performance concerns that motivated the choice of striping in the first place. Why not just start with a concatenation? Many system managers claim that holes in an NNTP stream are more valuable than the data. [7] While many might debate the value of most netnews content, there's no doubt that a file system composed of multiple disks in which there is no redundancy - wherein a single drive failure can cause the loss of all file data - is not a step towards the goal of greater reliability. It also adds to administrative costs because a failure becomes a crisis instead of a nuisance which can be resolved at relative leisure. With OnLine: DiskSuite, mirroring (RAID 1) is the only available solution, possibly with one or more hot spares to even further reduce the urgency of a failure. Mirroring unfortunately requires twice as many disks, preferably spread over twice as many controllers. With a requirement of 14.9GB, and using fast 4GB disk drives, eight drives and two controllers would be needed. Another year's growth would require two more controllers, forcing us to consider a more expensive SPARCserver 20 instead of the SPARCserver 5 we were contemplat- ing, just to get enough SBus slots. The hardware costs were escalating at an alarming rate! The only bright spot was that the ability to choose from amongst several disks when reading articles might mean a mirrored file system would perform better, assuming reads account for a significant percentage of the requests and the overhead of mirroring doesn't overwhelm this advantage. RAID 5 would certainly have reduced the hardware investment, but DiskSuite didn't have it until the Solstice DiskSuite 4.0 release [8]. This would have required at least Solaris 2.3, and with limited resources, SLAC had not yet taken on the challenge of supporting Solaris 2. Other vendors were also undesirable because we hoped not to invest the effort in migrating our netnews service to a whole new operating system. Around the time of this design effort, SLAC's High-Performance Computing Team (known informally as the ``Farm Team'') was looking at various large, high-performance file systems for use in a prototype data analysis effort. One of the alternatives being explored was a Network Appliance filer (commonly known as a toaster because of its appliance-like simplicity), and it was sug- gested as an appealing solution to the netnews problem. This was not the first time this product had been considered for netnews at SLAC - the large file system was very appealing - but the prospect of NFS achieving adequate perfor- mance relative to local disk for the many small files involved in processing netnews seemed far-fetched. There weren't any other appealing ideas on the horizon, so with some trepidation, we looked at the Network Appliance product further. In addition to the large file system, it had several other appealing features. More disk drives could be added as needed, even a single disk of different geometry from the others, without impacting performance. Even more interesting was the addition of support for very large directo- ries [9] since very high volume newsgroups produce directories with many thou- sands of files in them. Processing all files in a directory of n files requires Order(n2) search time in a traditional UNIX file system. BSD 4.3 reduces this to Order(n) for programs that process files in sequential, direc- tory order [10], but netnews often accesses files in a manner which reduces the effectiveness of this optimization. The BSD solution also does not help file creation, which is also Order(n) (i.e., Order(n2) to populate a directory of n files). The Network Appliance design only needs to examine n/256 direc- tory entries in the file creation case, and by using hash signatures even fewer string comparisons are required. This is still technically Order(n), but with a much smaller constant multiplier it should be considerably faster. (Processing all files in a directory in non-sequential order is similarly still Order(n2) but with a smaller multiplier.) We agreed to at least give an NFS solution from Network Appliance a chance, and they supplied a NetApp 1400 for evaluation. Benchmark Specification Standard benchmarks can be useful tools for comparing the performance of products from different vendors for common uses. Their usefulness is dimin- ished when one is confronted with a very specialized application. A netnews server places unique and demanding load patterns on a file system, so standard benchmarks were not considered for more than a moment in evaluating NetApp's filer. Besides, the obvious benchmark would have been SPEC SFS (LADDIS) [11] which only tests NFS. Thus it could not have been run against a local file system, one of the two alternatives we were considering. We therefore set about devising a suitable application benchmark which could be run against both alternatives and which would provide data which could be clearly corre- lated to an actual netnews server. The first step was to identify the key activities of a netnews server. Four such activities were identified. o Receiving and storing new articles. o Sending articles to other sites. o Expiring old articles. o Serving articles to readers. Receiving and storing new articles consumes the majority of most netnews servers' time. There is no opportunity for parallelism, even in the presence of multiple news feeds, because the incoming article streams are effectively serialized before checking the history file and possibly updating it and stor- ing the article. The goal of a server able to support the load expected at the end of 1997 means the server must be able to process each article in less than a third of a second. One of the challenges is that adding a file to a directory is an Order(n) problem in a normal UNIX file system, as discussed above. Each new article requires a file creation, and possibly the addition of links to other directo- ries if the article was cross-posted. Typical user directories rarely have more than a few hundred entries, and adding files to them does not signifi- cantly impede other activity on the system. Unfortunately, neither property is true for netnews. Sending articles to other sites was not initially perceived as being an important activity with regard to the file system, since nearly all of SLAC's outgoing news feeds use NNTP, and INN[2] [12] ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. tries to send articles as soon as they are accepted, which means they come from buffer cache and not from disk. David Lawrence noted that UUNET had encountered problems catching up when downstream NNTP sites went down. Reading the articles back from disk turned out to be a significant bot- tleneck [13]. Much of the process of expiring old articles happens in parallel with other netnews processing, but if expire doesn't get rid of old articles fast enough, incoming news may be stalled until sufficient space is available. Tra- ditionally, expire's file deletion pattern exhibits the Order(n2) behavior of pathological cases.[3] ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. ---------------- [3]Netnews is exceedingly good at finding and exploiting pathologi- cal cases to greatest disadvantage. While a command like rm will process files in the order they appear in a directory, taking advantage of BSD's optimizations, expire generates delete requests within a given directory in the order the files were created. With previous expirations plus article cancellations and other activities creating many holes in a directory, creation order may end up being fairly random. Expire also tends to jump from one directory to another, rather than focusing its efforts on one directory before moving on, which not only may cause disk seeks but also causes BSD to flush the cache which it uses to improve sequential directory accesses. Fortunately, INN includes the fastrm program which avoids these and other shortcomings of older expire implementations. INN's expire merely generates a list of files which it wants to delete, and feeds this list to fastrm to do the deletion. Finally, serving articles to readers would seem to be a very important part of what a netnews server does. In terms of performance, given a modest reader population, this task is in fact of little consequence. Multiple read- ers can be served in parallel, and any given reader most likely won't mind if fetching an article takes half a second instead of a quarter of a second. Such delays won't cause a backlog of work to pile up, at least for the server. The odds are that most articles won't ever be read anyway. Consider that at the time of LISA X, SLAC's netnews server is expected to be accepting about 900,000 articles per week. SLAC has roughly 1,200 employees and if half of them read netnews, each will have to read an average of 1,500 articles per week - 1,500 different articles from those read by anyone else at SLAC - for all of the incoming articles to be read. With a much larger user community, such as at a large Internet Service Provider, all of the articles might be read, but not at a place such as SLAC. With these guidelines in mind, a benchmark was constructed which would measure the performance of receiving articles and of expiring them. Batching articles was added to the test set later on, after the need became more appar- ent. Benchmark Construction The simplest way to construct a meaningful and repeatable benchmark seemed to be to capture a snapshot of a news feed, then feed it into actual netnews software, timing key parts of the process. While the new server was expected to run INN, C News[4] ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. ---------------- [3]Netnews is exceedingly good at finding and exploiting pathologi- cal cases to greatest disadvantage. ---------------- [4]The ``Cleanup Release of C News, with patch CR.E,'' from January 1995. was chosen for the benchmarks as its many pieces seemed better suited to iso- lation and individual examination than the monolithic structure of INN. The data structures on disk are identical and are manipulated in similar ways, so for a study of file system performance, results from one should apply to the other. The benchmark is a simple shell script performing some setup work, then multiple passes consisting of three phases: unbatch, batch, and expire. The critical piece of each phase is invoked with time to collect elapsed time and other statistics.[5] ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. ---------------- [3]Netnews is exceedingly good at finding and exploiting pathologi- cal cases to greatest disadvantage. ---------------- [4]The ``Cleanup Release of C News, with patch CR.E,'' from January 1995. ---------------- [5]Nfsstat -c is invoked before and after the timed piece to also capture NFS statistics. This data has not yet been analyzed as it doesn't directly impact the results of this project. It may provide some interesting tuning information, though. Only the core program in each phase is actually used, since the locking and other overhead of the higher-level scripts would serve no purpose and would obscure the results that are of interest. The unbatch phase of the benchmark, which measures the receiving and storing of new articles, is fairly straightforward. First, a large number of batches was copied to the incoming spool area. Enough data was used to ensure that all caches were flushed so as not to mask the performance of the underly- ing file systems. Then, the relaynews program was invoked directly on the batches. The same arguments were used as the newsrun script would use in a live system, except that no stale value was specified since the batches would likely contain very old news by the time the final benchmarks were run. The batch phase measures sending backlogged articles to other sites. To generate the batches, C News was configured to feed a single downstream site with the following sys file: # what we'll accept ME:all # downstream - everything (almost) downstream:all,!junk:f: In each pass, the batch generated by the unbatch phase is moved aside. Actual batching is only done every third pass in an attempt to reduce the already lengthy test time without losing too many data points, and the batch generated in the previous phase is used in order to simulate a seriously backlogged feed. (Since datasets were chosen to exceed memory size and thus eliminate cache effects, this probably has no real effect.) Again, only the component of the batching process which is directly affected by file system performance was used, in this case the batcher utility with input from the batch file and out- put to /dev/null. The expire phase is a little trickier since file deletion is rolled into one program along with scanning and rebuilding the history file, at least in the C News version. Moreover, expire works by looking at article timestamps, which for this benchmark are totally meaningless. Instead of using expire, its actions were synthesized in simplified form. First, the number of history entries was recorded after each unbatch phase. A simple awk script then uses this information to scan the history file, expir- ing articles from more than retention previous passes. The list of articles to be deleted is written to one file while the modified history file is written to another. (History entries are kept for the duration of the benchmark.) The new history file is then run thru dbz and moved into place. To measure the file system component of the expiration process, the file containing the list of article files to be deleted is then fed into either dumbrm or sort|fastrm. Dumbrm is a simple C program which reads pathnames from stdin and deletes them, generating the same unlink() calls as expire itself would. Fastrm is the utility from INN[6], ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. ---------------- [3]Netnews is exceedingly good at finding and exploiting pathologi- cal cases to greatest disadvantage. ---------------- [4]The ``Cleanup Release of C News, with patch CR.E,'' from January 1995. ---------------- [5]Nfsstat -c is invoked before and after the timed piece to also capture NFS statistics. This data has not yet been analyzed as it doesn't directly impact the results of this project. It may provide some interesting tuning information, though. ---------------- [6]INN1.4-sec from December 22, 1993. invoked with the same options used by INN's expire. The sort is included in the timing to produce a more fair com- parison to dumbrm; it's debatable whether or not this should have been included. All the News That's Fit to Test Collecting data to fuel the benchmark was accomplished by creating a fake uucp feed of all articles on SLAC's netnews server and capturing approximately half a week's batches from August, 1995. A total of 2759 batches were col- lected, containing 300,150 articles in 911 megabytes. While this was good enough to generate interesting results, a set large enough to represent at least a week was desired. A script was written which reads an existing batch and modifies message ids in Message-ID and Supersedes headers and the target message id of cancel control messages. To do this, a delimiter and a clone number are appended to the host portion of each message id, a combination unlikely to conflict any real message id. Byte counts were fixed and the clone batch written. With the original data and two clones, over 900,000 articles were available, approximating the expected feed for the first week of October, 1996. The benchmark was configured to run 14 passes with 592 batches per pass (shortchanging pass 14 by 11 batches), with an expiration step of 6. This sim- ulates a system running expire twice per day, with a three day retention period for all newsgroups. Most sites probably only run expire once per day, but the more frequent expirations encourages rapid fragmentation of the file system and directories, thus simulating a more mature system. Testbed Configuration The initial testbed used at SLAC consisted of a SPARCserver 2 and a NetApp 1400. While this setup produced sufficient results to make an unequivo- cal choice between the alternatives, the author's move to a job at Network Appliance offered the opportunity to run a more thorough set of tests on a wider variety of configurations. These results are the ones presented in this paper. The primary test host used was an Axil 311, a SPARCserver 20 clone but with a CPU module that appeared to be equivalent to that in a SPARCserver 10/41. This system was equipped with 128MB of memory, a Sun Quad Ethernet interface card, and a Cisco CDDI interface card. Five 4GB Seagate ST15230N (Hawk) Fast SCSI disks were attached to the on board SCSI bus, with the first used as the system disk (the internal disk was disconnected) and the other four used for /var/spool/news (the article spool file system(s)) or unused when testing against a filer. SunOS 4.1.4 was installed with a large (nearly 2GB) partition left on the system disk for /usr/lib/news (where the history file and news configuration files are stored). C News binaries and the bench- mark itself were also stored on this disk. Two filers were used, both running the NetApp 3.1.4c Data ONTAP software release. The first was a NetApp 1400 configured identically to the one pur- chased by SLAC - 128MB, 2MB NVRAM, a single 10 megabit/second Ethernet inter- face, and seven 4GB Seagate ST15230N Fast SCSI disks. (This model is no longer offered; the current entry-level NetApp F220 is about twice as fast as the NetApp 1400 based on LADDIS results.) To explore the performance of a high- end filer which a large Internet Service Provider might prefer for greater performance and/or disk capacity, a NetApp F540 was also tested. This filer had 256MB, 8MB NVRAM, both 10/100 megabit/second Ethernet and CDDI interfaces, and seven 4GB Seagate ST15150W (Barracuda) Fast/Wide SCSI disks. There was some debate about the effect of several filer options recommended by Network Appliance for netnews applications, as well as the value of FDDI versus a dedicated Ethernet, so another Axil was borrowed to run some abridged test runs against the filers while the first Axil was running various test runs involving only local disks. This second unit was an Axil 235, apparently a SPARCserver 10 clone with a SPARC 20 CPU module (!), with 64MB, a Cisco CDDI interface card, and a single 4GB Seagate ST15230N Fast SCSI disk that was cloned using dd from the first system after it had been config- ured but before OnLine: DiskSuite was installed. Other than one of the filer software configuration options (no_atime_update), default values were used on the filers and on SunOS. In particular, extra inodes were not allocated on either the filers or on SunOS file systems during initialization, nor was the Snapshot feature of the filers [14] disabled. The default inodes value for newfs seemed to be adequate. The filers would need more inodes in practice, but since the number of inodes can be increased on-the-fly there's no need to add more until one has a better idea of how many will be needed. The first impulse would be to disable Snap- shots entirely, but having at least one hourly Snapshot might be handy for recovering from the occasional slip of the fingers as root. Since the cost of creating a Snapshot is inconsequential, there's no performance reason to dis- able them, only the need to recover disk space fairly quickly after expire deletes articles. The Ethernet consisted of a crossover (hub-to-hub) cable for the Ethernet tests, and an isolated DEC CDDI concentrator for the FDDI tests. Another machine was also attached to the FDDI network to provide a repository for the test batches and for test results. (No access to this machine took place dur- ing instrumented portions of the benchmark.) -------------+-----+------+--------------------------------------+------------- |disk | type | hierarchy | +-----+------+--------------------------------------+ | 1 | mnt | alt (alt.binaries linked to disk 4) | | 2 | link | rec,soc,talk,de | | 3 | link | comp,misc,sci,news,bit,gnu,vmsnet | | 4 | dir | everything else (incl. alt.binaries) | Figure 1: Newsgroup-allocation-for-symlinked-local-disks-and-method of attaching hierarchy to main spool area. ------------------------------------------------------------------------------- The benchmark runs using the four local disks for /var/spool/news were done with three different configurations. The first was run without OnLine: DiskSuite and used a 2GB partition on each disk[7], ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. ---------------- [3]Netnews is exceedingly good at finding and exploiting pathologi- cal cases to greatest disadvantage. ---------------- [4]The ``Cleanup Release of C News, with patch CR.E,'' from January 1995. ---------------- [5]Nfsstat -c is invoked before and after the timed piece to also capture NFS statistics. This data has not yet been analyzed as it doesn't directly impact the results of this project. It may provide some interesting tuning information, though. ---------------- [6]INN1.4-sec from December 22, 1993. ---------------- [7]2GB is the largest partition supported by SunOS 4.1.4 without us- ing OnLine: DiskSuite. with space parceled out by hand via symlinks (and a mount for alt) as detailed in Figure 1. For the next test, OnLine: DiskSuite was used to create a 16GB striped partition using an interlace factor of 16KB.[8] ---------------- [-]This work supported by the United States Department of Energy un- der contract number DE-AC03-76SF00515, and simultaneously published as SLAC PUB-7254. ---------------- [1]SLAC uses 17 days as the expiration period for essentially all news- groups and for history data. History data had been kept for 30 days, but when disk and memory constraints became severe it was decid- ed that this no longer added much value given the fast propagation of netnews in the net today. ---------------- [2]SLAC was still running C News at the time, even though it was fairly clear that INN was better suited to SLAC's needs. With re- sources scarce, installing the new software had been deferred until the new server was acquired. ---------------- [3]Netnews is exceedingly good at finding and exploiting pathologi- cal cases to greatest disadvantage. ---------------- [4]The ``Cleanup Release of C News, with patch CR.E,'' from January 1995. ---------------- [5]Nfsstat -c is invoked before and after the timed piece to also capture NFS statistics. This data has not yet been analyzed as it doesn't directly impact the results of this project. It may provide some interesting tuning information, though. ---------------- [6]INN1.4-sec from December 22, 1993. ---------------- [7]2GB is the largest partition supported by SunOS 4.1.4 without us- ing OnLine: DiskSuite. ---------------- [8]The default for OnLine: DiskSuite is the size of a cylinder on the first disk, which seemed inappropriate for a modern SCSI disk for which all cylinders might not be the same size. The 16KB value was borrowed from the default in Solstice DiskSuite 4.0. The third configuration used two concate- nated disks, with the second pair of disks mirroring the first pair, providing an 8GB file system. Toaster or Local Disk? The results of the benchmark runs were dramatic. Because of the added operational flexibility of the toaster, we would have been happy if its per- formance was comparable to local disks. In fact, as Figure 2 shows, the 1400 was nearly four times as fast, on average, as the best configuration of local disks on the unbatch tests. The F540 was over five times as fast! Equally surprising was the poor performance of the OnLine: DiskSuite configurations. The mirrored concatenation arrangement was barely 50% faster than the absolute minimum of three articles per second required by the end of 1997, precious little headroom for catching up much less capacity for future growth. ---articles-per-second--------------------------------------------------------- 45 +--------------------+ 40 +---+ -+ 35 +- | -+ 30 +- +---+ -+ 25 +- | | -+ 20 +- | | -+ 15 +- | | -+ 10 +- | +---+---+---++ 5-+---+---+---+---+---++ F5401400symstrimirror FDDIEthelink concat Figure 2: Average number of articles unbatched per second over full benchmark run for various configurations. ------------------------------------------------------------------------------- The only result that didn't come as a big surprise was that striping was faster than concatenation (with mirroring), by about 33%. The conjecture that concatenation might cause one disk (or mirrored pair) to be filled before mov- ing on to the next was not borne out, however. During the runs, visual obser- vation of the disk activity lights and monitoring via the iostat utility both indicated that the data was being spread amongst the two disks even though neither was near capacity. Looking at the data in more detail, some performance degradation can be seen in Figure 3 as the file system fills and ages. In pass 14, the NetApp F540 and the symlinked local disks both retain 86.5% of their performance from pass 1; the NetApp 1400 lost a bit more (down to 85%) but the difference is probably not significant. Pass 8 is the first pass after expiring articles (only articles more than six passes old are expired, so the expire phase at the end of pass 7 is the first to actually do anything), which presumably has something to do with the surge in filer performance. The local disk test shows a similar though less pronounced effect, with 7.1% better performance in pass 8 than in pass 7, compared to 12.6% for the NetApp 1400 and 14.8% for the NetApp F540. This may be an artifact of the test data, but no examination has been done to determine if there is anything unusual about the articles pro- cessed in pass 8. ---articles-per-second--------------------------------------------------------- 45 ++-+--+--+--+--+--+--+ 40 +- -- ---------+ 35 +++ -- -+ 30 +- +++++++++++++++++++ 25 +-F540--- -+ 20 +-1400++++ -+ 15symlink+++ -+ 10 +++++++++++Figure+3:++Articles unbatched per second ------0-+--+--+--+--+--+--+--+------------------------------------------------- 0 2 4 6 8 10 12 14 The performance advantage of the filers over local disks is even more dramatic for expiration. As seen in Figure 4, the F540 is over an order of magnitude faster than local disk and the 1400 is a respectable 8.5 times as fast. Data for the OnLine: DiskSuite configurations is not shown because each pass was painfully slow and getting slower. With limited equipment time avail- able, the mirrored concatenation test was stopped after six passes, which were sufficient to show that it was significantly slower than the symlinked arrangement, while the stripe run was interrupted after five complete passes by a power failure. ---minutes--------------------------------------------------------------------- 100 +------------+----+--+ | | | | 80 +- | | -+ 60 +- | | -+ | | | | 40 +- | | -+ | | | | 20 +-+-----+----+ | -+ 0-+-+-----+----+----+--+ F540 1400symlink FDDI Ether Figure 4: Average time in minutes required for delete portion of expire. ------------------------------------------------------------------------------- All of the tests in this set used dumbrm rather than INN's fastrm. This made little difference for the filers, as detailed later in this paper. Pre- sumably expiration on local disks would have benefited greatly from fastrm, but the unbatch numbers had already convincingly shown that local disks were marginal at best for the job, so it was felt that the several days of test time needed to complete a fastrm run would not be productive. The final set of comparisons are those for batching outgoing news feeds, shown in Figure 5. While the difference is not as dramatic as it is for the other parts of the benchmark, the filers are still faster than local disks. In addition, the local disk results appear to be getting slower as the file sys- tem ages, whereas the filers suffer much less performance degradation. ---minutes--------------------------------------------------------------------- 70 +--------------------+ 60 +- ++++ 50 +++++++++++++++++++ -+ +- F540 --- -+ 40 | 1400 +++ | 30 +- symlink +++ -+ 20 ++++++++++++++++++++++ 10 +--------------------+ 0 +--------------------+ 3 6 9 12 benchmark pass Figure 5: Average batch time in minutes. Light or Dark Toast? Prior to the final set of filer runs described above, another set of tests was run to evaluate the effects of several alternative filer and net- working options. Network Appliance recommended that the following filer options be changed from the defaults for netnews applications: options no_atime_update on options minra on The first option causes the filer to not update access times on files. For netnews, there's no apparent value in maintaining file access times, and not updating them saves the expense of writing the updates to disk. The case for the second option, which causes the filer to refrain from aggressive read ahead (in anticipation of sequential access to an entire file), is less clear. Assuming all user access is via NNTP [15], the only apparent cases in which one would start reading part of an article and not read to the end-of-file would be using NNTP's HEAD command or if the connec- tion is broken during an ARTICLE or BODY command. With contemporary newsread- ers using the XOVER command and the NOV database to access most data formerly obtained using HEAD, it's not clear what value turning off read ahead pro- vides. The only plausible justification is that Internet Service Providers who might have large numbers of customers accessing large binary postings, via comparatively slow modem links, might end up wasting filer memory, and perhaps causing thrashing, because too much data is being cached too far in advance of its being needed. Besides studying these options, we wondered how much benefit would be derived from the lower latency of FDDI, the benefit of fastrm on the filer, and whether or not it would be advantageous to also place the history file on the filer. A series of benchmark runs were done using the NetApp F540 and the second Axil, described above, to compare these alternatives. Since there were a num- ber of different runs to perform, they were abbreviated to only ten passes. This allowed three batch samples and three post-expire unbatch samples, which was felt would provide enough data to draw reasonable conclusions without tak- ing an inordinate amount of time. The first pairing was FDDI versus Ethernet. Using FDDI, the unbatch tests ran in an average of 90.8% of the time needed with Ethernet. Batching was even faster, taking only 88% of the Ethernet time. Expire was marginally slower over FDDI, but the difference is probably statistically insignificant. The remainder of the tests with the NetApp F540 were done using FDDI. (Tests of the NetApp 1400 were done using Ethernet because it did not have an FDDI interface.) The next set of tests individually compared the two recommended options against the baseline FDDI test. Neither had any significant effect on unbatching, which was unsurprising since both options influence reads, and unbatching does little reading from the article spool. For batching, minra had little effect (it was expected that it would hurt) but not updating access times saved about 5% of the baseline FDDI time. As expected, expire saw no benefit from not updating access times, but seemed to be slightly faster with minra. It's not clear why this option would have any effect on file deletion. Since only no_atime_update was clearly beneficial it was the only option used in the remainder of the tests. ---articles-per-second--------------------------------------------------------- 45 +-+-+--+-+-+--+-+--+-+ 40 +------ --- -+ 35 +- ----- --+ 30 ++++ -+ 25 +- ++++ -+ 20 +- ++++ +++++++ 15local--- ++++++ -+ 10filer++++ -+ 0 +-+-+--+-+-+--+-+--+-+ 1 2 3 4 5 6 7 8 9 10 benchmark pass Figure 6: Average number of articles unbatched per second with the history file on the filer instead of local disk, with articles stored on the filer in both cases. ------------------------------------------------------------------------------- The next test showed that expire ran several percent faster using fastrm. However, since we did not have fastrm data for the local disk cases in the primary testbed, the full benchmarks run against the filers were done using the marginally suboptimal (for filers) dumbrm. The last of these tests produced the most interesting results. With the history file and other contents of /usr/lib/news on the filer along with the article spool, batching was about 2% slower and expire about 3% slower than only having the article spool on the filer. No difference would have been expected for expire since the timed portion of expire does not access anything in /usr/lib/news. Presumably the extra filer activity before this step pushed some data out of cache, slowing expire. The big difference came during the unbatch tests. As illustrated in Figure 6, putting the history file on the filer produced unbatch times which started off taking 22% longer, progressing to as much as 150% more time as the system aged. Future Research One of the implementors of the log-structured file system (LFS) in BSD 4.4 [16, 17] suggested that netnews would be a good application for LFS, since a log-structured file system is designed to optimize writing to disk. This is how the critical unbatch part of netnews processing spends much of its time. The success of the Network Appliance filers is consistent with this conjecture since their Write Anywhere File Layout (WAFL) [14], while not a log-structured design, similarly optimizes writes to minimize head seeks. Using a BSD 4.4 system to compare an LFS-based article spool file system to an FFS-based equivalent would be interesting, though a filer would still be expected to offer better performance due to the large directory support, if nothing else. The b+ tree data structures used for directories in the Windows NT file system [18] to allow it to perform quick file lookups in large directories make NTFS seem appealing for netnews. Other aspects of NTFS appear to incur more overhead than would be desired for an application as demanding as net- news. Since Network Appliance's new Multiprotocol Filer software features native support for CIFS, the equivalent of NFS in the Windows networking world, and several netnews implementations are available for Windows NT, another netnews comparison project is likely in the author's future. Conclusions Then let us praise the brave appliance In which we place this just reliance [19] The surprisingly fast performance of the Network Appliance filer in the netnews benchmarks made the decision regarding SLAC's netnews server obvious. More important, though, the results served as a strong reminder to avoid pre- conceptions. Benchmarks can produce surprising results, which presumably is why many people run them in the first place. Finally, NFS, despite its age and weaknesses, can still do a remarkably good job. Availability The benchmark tools may be made available if there is interest. Please contact the author via e-mail at kls@netapp.com for more information. Acknowledgments Special thanks to Mom and to my wife, Krissie. Mom was always encouraging even when she had no idea what I was really working on. I'll miss her. Krissie has been very understanding and supportive of my long work hours dur- ing what is supposed to be our honeymoon year. Others who helped in various ways include Mark Barnett, George Berg, Chuck Boeheim, Bob Cook, Renata Dart, Rosemary Dinelli, Walt Disney Co., Guy Harris, Dave Hitz, Moana Kutsche, Randy Melen, Lincoln Myers, Rob Salmon, Arnie Thompson, Andy Watson, Bebo White, and others whose contribution is not diminished by my failure to remember them here. My gratitude goes to all of them. Thanks, too, to Alexander for his patience. Still no more skunks, but he's on Bath Row anyway. Author Information Karl Swartz was Team Leader of the System Administration Team in SLAC Computing Services at the Stanford Linear Accelerator Center when this work was started. He was so impressed by the toaster's performance that he joined Network Appliance as a Technical Marketing Engineer. Prior to SLAC, he worked at the Los Alamos National Laboratory on computer security and nuclear materi- als accounting, and in Pittsburgh at Formtek, a start-up now owned by Lock- heed-Martin, on vector and raster CAD systems. He attended the University of Oregon where he studied computer science and economics. Between work and a new wife, Karl hasn't been on the racetrack in far too long, but he does find time to moderate a newsgroup (sci.aeronautics.airliners) and to enjoy good food and good beer and trips to the beach with his wife, Krissie, and their Golden Retriever, Alexander. Krissie would be very upset if either of them castrated or slaughtered cattle. E-mail Karl at kls@chicago.com or kls@netapp.com. References 1. Karl L. Swartz, ``Forecasting Disk Resource Requirements for a Usenet Server,'' Proceedings of the 7th USENIX Large Installation System Adminis- tration Conference (LISA VII), pp. 101-108, Monterey, California, November 1993. Also published as SLAC-PUB-6353. 2. Karl L. Swartz, ``Usenet Growth Graphs,'' https://www.chicago.com/~kls/news- growth.html. 3. Rick Adams, Usenet post, c. September 1993. 4. OnLine: DiskSuite Reference Manual, Sun Microsystems, Mountain View, Cali- fornia, 1991. 5. D. Patterson, G. Gibson, and R. Katz, ``A Case for Redundant Arrays of Inexpensive Disks (RAID),'' ACM SIGMOD 88, pp. 109-116, Chicago, June 1988. 6. Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry, ``A Fast File System for UNIX,'' in 4.4BSD System Manager's Manual, O'Reilly & Associates, Sebastopol, California, April 1994. 7. V. Jacobson, ``Compressing TCP/IP Headers for Low-Speed Serial Links,'' RFC 1144, February 1990. Footnote 29. 8. Solstice DiskSuite 4.0 Administration Guide, Sun Microsystems, Mountain View, California, March 1995. 9. Byron Rakitzis and Andy Watson, Accelerated Performance for Large Directo- ries, Technical Report 3006, Network Appliance, Mountain View, California, February 1996. 10. Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman, The Design and Implementation of the 4.3BSD UNIX Operating Sys- tem, Addison-Wesley, Reading, Massachusetts, 1989. 11. Mark Wittle and Bruce E. Keith, ``LADDIS: The Next Generation in NFS File Server Benchmarking,'' Proceedings of the 1993 Summer USENIX Technical Conference, pp. 111-128, Cincinnati, Ohio, June 1993. 12. Rich Salz, ``InterNetNews: Usenet transport for Internet sites,'' Proceed- ings of the 1992 Summer USENIX Technical Conference, pp. 93-98, San Anto- nio, Texas, June 1992. 13. David Lawrence, private conversation, January 1996. 14. Dave Hitz, James Lau, and Michael Malcolm, ``File System Design for an NFS File Server Appliance,'' Proceedings of the 1994 Winter USENIX Technical Conference, pp. 235-245, San Francisco, January 1994. Also published as Network Appliance Technical Report 3002. 15. Brian Kantor and Phil Lapsley, ``Network News Transfer Protocol,'' RFC 977, February 1986. 16. Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin, ``An Implementation of a Log-Structured File System for UNIX,'' Proceedings of the 1993 Winter USENIX Technical Conference, pp. 307-326, San Diego, California, January 1993. 17. Margo Seltzer, Keith A. Smith, Hari Balakrishnan, Jacqueline Chang, Sara McMains, and Venkata Padmanabhan, ``File System Logging versus Clustering: A Performance Comparison,'' Proceedings of the 1995 Winter USENIX Technical Conference, pp. 249-264, New Orleans, Louisiana, January 1995. 18. Helen Custer, Inside the Windows NT File System, Microsoft Press, Redmond, Washington, 1994. 19. Thomas M. Disch, The Brave Little Toaster, Doubleday & Company, Garden City, New York, 1986. The full-length animated movie of the same title, based on this novella, inspired the title of this paper.