FeatureUSENIX

 

the ABCs of TPCs and NT scalability, II

gunther by Neil Gunther
<ngunther@ricochet.net>

Neil Gunther is founder and principal consultant for Performance Dynamics Company in Mountain View, CA. Dr. Gunther has worked in the Silicon Valley for 18 years. He is a member of IEEE, ACM, and CMG.

In the special ;login: issue on Windows NT [1], I promised to delve more into my concerns about the comparisons of UNIX and NT scalability that were presented at the USENIX-NT Workshop last August. In this second article, I want to start with the data presented in Figure 1, which purported to show the superiority of NT over UNIX scalability on the common basis of the TPC-C benchmark workload.

No Disk
File
Figure 1. Microsoft version of NT vs. UNIX

Before doing so, however, I have to assume that most readers are not familiar with the TPC approach to database benchmarking. Unfortunately, there is not enough space to go into great detail about this complex measurement process, so I can provide only the briefest of sketches. The interested reader can find specifics at <www.tpc.org>.

TPC Road Rules

Unlike many computer benchmarks (e.g., Dhrystone, Linpack, SPEC), TPC benchmarks do not exist as code that you purchase or download. Rather, TPC provides a (downloadable) benchmarking specification document. Anyone wishing to run the benchmark is free to implement the specification in any way he or she sees fit. You are not free, however, to interpret the TPC rules as you please. In order to report an official TPC result, you must write a corresponding full disclosure report that itemizes how you met each one of the clauses in the TPC specification. In addition, the benchmark runs that produced the result you wish to report must be witnessed and reviewed by an official TPC auditor at runtime. Your disclosure report is also reviewed by members of the TPC council. Any discrepancies that cannot be satisfactorily explained may lead to the result being withdrawn. In other words, TPC benchmarks are a serious and expensive undertaking that come with a high degree of credibility. Any attempt to cut corners is likely to be spotted and dealt with accordingly.

The TPC Performance Race

Currently, there are two TPC benchmarks: TPC-C (for benchmarking online database transaction processing: AKA OLTP systems) and TPC-D (for benchmarking decision support systems: AKA DSS). The TPC-A and TPC-B benchmarks have been retired for two major reasons. First, these workloads corresponded to a relatively simple debit/credit banking transaction. Second, removal stops ongoing attempts to exploit any loopholes in those benchmark designs. Moreover, both the TPC-A and TPC-B were directed solely at OLTP performance. TPC-C is a more complex OLTP benchmark that uses a heterogeneous mix of five transactions accessing a database that models inventory control in a distributed warehouse. TPC-D is the first TPC benchmark to be directed at multi-user, large-scale, query-intensive systems.

Rather than get bogged down in technical details, I've chosen to highlight the difference between TPC-C and TPC-D using the following whimsical analogy with automobile sporting events.

TPC-C Indianapolis 500

TPC-C is the Indy 500 of database benchmarking. In the real Indy event, 35 vehicles race around a 2.5-mile circuit and the first car over the finish line on the 200th lap is declared the winner. The sporting focus is on the performance of individual vehicles as measured by their top speeds in miles per hour.

In the TPC-C benchmark, the database transactions are analogous to the Indy race cars, but the performance focus is shifted away from the cars and onto the racetrack itself. For example, a wet track is slower than a dry one. The performance of the track could be measured by the number of cars per minute the track can support over the 200 loops of the Indy 500 race. It is a measure of the raceway's carrying capacity. Technically, this would be accomplished by counting the number of cars that cross the same place (e.g., the starting line) every five minutes (roughly the time it takes a car to make one loop of the raceway) and averaging those counts over the duration of the entire race. Under TPC road rules, any car taking longer than five minutes would not be counted as part of the track's capacity. In the TPC version of the Indy 500, there is another rule that all cars must make at least one simultaneous pit stop (corresponding to a database checkpoint) and then continue again.

In practice, when the checkered flag falls, all the cars take some time to maneuver into position and get up to top speed. In the TPC-C benchmark, this corresponds to the ramp-up period necessary to get the database cache warmed up and the system operating in steady state. This ramp-up period is not included in the performance results. In the real TPC-C benchmark, transactions committed every half minute or so are counted and used to determine the average throughput measured as transactions per minute (or tpmC) over the entire benchmark run. Any transaction that does not commit within a two-second minimum response time is not counted.

That Transparency Thing

Furthermore, suppose you wanted to assess the Indy track capability on a worldwide basis (e.g., tracks in the US, Australia, Canada, and Britain). This would be a way to compare Indy racing with other kinds of races (e.g., NASCAR racing). The worldwide Indy performance would be given as the sum of the performance of each Indy raceway.

If you raced only US cars on the US track, Australian cars on the Australian track, and so on, you would be unintentionally optimizing the measurement. The TPC-C version of measuring this worldwide Indy performance does not permit such an optimization. Instead, you must also run some US cars on the Australian raceway, some Australian cars on the British track, and every other permutation in between. Moreover, which car runs on which track must be determined by drawing track-car pairs out of a hat. In other words, you are not allowed to bias the results by knowing beforehand which car will race on which track. The selection process is then said to be unbiased or transparent

Similarly, in the real TPC-C benchmark, you can have four servers with four separate database instances, but TPC-C does not permit you to confine transactions to each database separately and then add the separate throughputs together to give the total capacity. Transactions must be distributed in such a way that any transaction can access any of the four database tables without knowing ahead of time which database it will run against. This adds realism to the benchmark. But transparency can also introduce some performance degradation due to the longer code paths needed to distribute the transactions.

Clearly, it would be much simpler to ignore this transparency requirement and just add up the throughputs of more and more independent servers. That is an easy (but unrealistic) way to generate a big throughput number without any distribution overhead. That's precisely what Microsoft did; but because it violates TPC-C road rules, they could not report it as a bona fide TPC-C result. It would never have gotten past the TPC auditor. Gray claimed this was just a "technicality." [2] Now you can decide. On top of this failure, they didn't use TPC-C transactions either, contrary to the statement in [3]. What did they use? We'll never know because, not running a TPC benchmark, they were not subject to the disclosure rule. Gray used the term "debit-credit transaction," which suggests some kind of banking transaction, but we don't know that. Instead of saying Microsoft did 1 billion transactions per day, I'd prefer to call it 1 billion diddleysquats per day just to remind myself that the entire Microsoft claim is beFUDdled.

TPC-D Monster Tractor-Pull

In contrast to the TPC-C Indy 500 race, TPC-D is more like a monster tractor pull. In the TPC-D version of the tractor pull, there are 17 vehicles of different weights that the tractor must tow across the arena to complete the competition. For each tow, the elapsed time to get across the arena is measured and used to construct an overall towing capacity for the tractor. There's no constraint on how long it takes to pull all 17 vehicles because only the elapsed time for each pull is measured. In the real TPC-D benchmark, the key performance metric is the time taken to execute each of the 17 queries. Gray did not discuss TPC-D results for SQLServer because there aren't any for SQLServer. You can check for yourself at <http://www.tpc.org/execsum_TPCD.html>. But there are many TPC-D results on UNIX.

Sensible Scalability Comparisons

Figure 2 shows a comparison of TPC-C results across a wide variety of results published in 1997. The most important notable difference from Figure 1 is that there are no curves. That is because these are all different platforms running various flavors of UNIX, different RDBMSs, on different hardware. Using curves (as in Figure 1) would erroneously suggest that certain data belong to the same family, when they do not. Recall what I said about the performance analyst's cardinal rule [1]: only change one thing at a time!

There are four CPU categories shown in Figure 2: uniprocessor, two-way, four-way, and six-way multiprocessors. In each CPU category, the UNIX results are grouped to the left while the NT results are grouped to the right. I've selected official TPC-C UNIX and NT results for all of 1997 to give some reasonable definition to my requirement [1] that the data be in some sense contemporaneous. The selected servers have between one and six processors to conform to the range where NT actually tries to compete with UNIX servers.

Things look a little less impressive for NT than in Gray's benchmarketing presentation [2]. First, note that there is considerable variance within each UNIX group. This is to be expected because (unlike NT) there is no single UNIX, and the data in Figure 2 include Oracle and Sybase running on various UNIX platforms. Typically, CPUs with larger second-level caches produce higher throughput because they can accommodate a larger RDBMS footprint.

No Disk File
a Figure 2. 1997 UNIXs vs. 1997 NTs

Second, there is far less variation within each NT group. This is to be expected when there is only one RDBMS (viz., SQLServer) tuned to run relatively few Intel-based architectures. Note also that for the six-way configurations, the best UNIX result (HP/Sybase) has more than twice the throughput performance of the NT system, and the next best UNIX result (Sun/Sybase) is more than 30% better than NT. This demolishes Gray's point based on Figure 1 that one needs to go to a more expensive 12-way UNIX system just to match a six-way NT in throughput. How did I arrive at a different conclusion? I didn't bias the data by handpicking aged Solaris/Sybase TPC-C results for making NT comparisons.

Table 1 summarizes the various platform combinations that have been reported. Sequent has announced a parallel query result on a four-node NUMA-Q 2000 cluster with dual-quad CPUs (32 total CPUs). This is not a TPC-D result, however. Also, Compaq has an official TPC-C result with Oracle on NT (third column in Figure 2). There are no SQLServer results on UNIX that I know of.

Table 1: Database and platform combinations

RDBMS \ OS NT Solaris UNIX

SQLServer **?

Sybase ***

Others (a) **

Price-Performance Comparisons

We can use the disclosed price of the TPC benchmark platform expressed as $/tpmC to make the comparisons shown in Figure 3. When it comes to price-performance, Microsoft does indeed have the drop on UNIX, especially at the low end. But it's not so dramatic for larger CPU configurations. In case you're wondering, the expensive outlier in the two-way class is a Fujitsu UNIX box.

Open system hardware is generally cheaper than mainframes, so how can Wintel pricing beat UNIX so convincingly? One way of looking at this is to recognize that history is simply repeating itself. Over the last 20 years, UNIX workstations and multiprocessors have eroded the profit margins that were sacred to selling mainframe big iron. This occurred because UNIX boxes were cheaper to build and became more ubiquitous than centralized mainframes. At the outset, they could not compete with mainframe performance, but gradually, that changed as UNIX systems scaled up.

No Disk File
Figure 3. Price-Performance Comparisons

Over the last ten years, the PC has become more ubiquitous than UNIX servers. They represent real commodity computers. At the outset, they could not compete with UNIX workstation or multiprocessor performance, but gradually, that is changing as PC-based systems scale up. In other words, the PC shall do unto UNIX servers what UNIX servers hath done to mainframes.

Next time, I'll consider the factors that determine hardware scalability.

Acknowledgments

I am grateful to Kim Shanley (TPC CEO), Francois Raab (TPC auditor), and Mike Brey (Oracle) for various technical discussions.

Notes

[1] N.J. Gunther, "NT to the Max. . .(NoT)," ;login: November 1997, pp. 9-11.

[2] J. Gray, "Windows NT to the Max." Original presentation slides are online.

[3] B. Dewey, Keynote address. Summarized by Brian Dewey, ;login: November 1997, pp. 32-33.

 

?Need help? Use our Contacts page.
First posted: 4th February 1998 efc
Last changed: 4th February 1998 efc
Issue index
;login: index
USENIX home