On Mindcraft's April 1999 Benchmark

"First they ignore you, then they laugh at you, then they fight you, then you win." -- Gandhi (?)

Executive summary: The Mindcraft benchmark proved to be a wake-up call, and the Linux community responded effectively. Several problems which caused Apache to run slowly on Linux were found and resolved.

As of mid-1999, Linux/Apache performance was identical to NT/IIS performance under light load, and respectable under heavy load.
As of May 2000, performance on Mindcraft-like benchmarks finally may be substantially improved on 4 CPU tests.
As of February 2001, performance on the SAP database benchmark on a 4 CPU machine is dramatically better with 2.4 compared to 2.2.
In January 2002, Ingo Molnar's 0(1) scheduler was added to the 2.5.2-pre10 kernel. This should help Linux scale better to many processors and many running threads. (The new scheduler will probably not be included by default in Linux distributions until 2003, but it is already available as a patch to the 2.4.17 kernel for the few people who might need it.)
By early 2003, the NGPT and NPTL projects have succeeded in bringing high-performance POSIX-compliant threading to Linux. The first commercial distribution to include NPTL will likely be Red Hat 8.1 in April 2003.
Graphs showing remarkable progress in performance of 8-CPU SMP Linux systems are online at www-106.ibm.com/developerworks/linux/library/l-kperf.
In May 2003, Microsoft released file serving benchmarks showing Windows 2003 outperforming Linux. Here we go again!

I've split the summary of recent benchmark results (including the Mindcraft benchmarks) off into a separate page, "NT vs. Linux Server Benchmark Graphs", at the request of readers who liked the data but weren't interested in the history of the Linux kernel hacking community's response to the benchmarks.

Introduction
Previous Benchmarks
Kernel issue #1: TCP bug
Kernel issue #2: Wake-One vs. the Thundering Herd Updated 3 Oct 2000
Kernel issue #3: SMP Bottlenecks in 2.2 Kernel
Other Apache users getting help solving performance problems
Kernel issue #4: Interrupt Bottlenecks Updated 2 March 2000
Kernel issue #5: Mysterious network slowdown
Kernel issue #6: 2.2.x/NT TCP slowdown
Kernel issue #7: Scheduler Updated 27 Jan 2000
Kernel issue #8: SMP Bottlenecks in 2.4 kernel Updated 19 May 2000
Kernel issue #9: csum_partial_copy_generic Updated 17 Sept 2000
Kernel issue #10: getname(), poll() optimizations
Kernel issue #11: Reducing lock contention, poll overhead in 2.4 4 June 2000
Kernel issue #12: Poor disk seek behavior in 2.2, new elevator code in 2.4 Updated 4 Oct 2000
Kernel issue #13: Fast Forwarding / Hardware Flow Control 18 Sept 2000
Kernel tuning issue: hitting TIME_WAIT 31 March 2000
Suggestions for future benchmarks
Related reading

Introduction

In March 1999, Microsoft commissioned Mindcraft to carry out a comparison between NT and Linux showing that NT was 2 to 3 times faster than Linux. This provoked responses from Apacheweek.org, Eric Lee Green, Jeremy Allison, and Linux Weekly News, among others. Microsoft posted a rebuttal to the responses (evidently Microsoft takes this very seriously).

The responses generally claimed that Mindcraft had not configured Linux properly, and gave specific examples. Both Mindcraft and the Linux community agree that good tuning information for Linux is hard to find.

Why the Outcry?

The Linux community responded to Mindcraft's announcements with hostility, at least partly because of Mindcraft's attitude. Mindcraft stated "We posted notices on various Linux and Apache newsgroups and received no relevant responses." and concluded that the Linux 2.2 kernel wasn't as well supported as NT.

Mindcraft did not seem to take the time to become familliar with all the appropriate forums for discussion, and apparantly did not respond to requests for further information (see section III of Eric Green's response). Others have had better success; in particular, three key kernel improvements all came about in the course of normal support activities on Usenet, the linux-kernel mailing list, and the Apache bug tracking database. I believe the cases illustrated below indicate that free 2.2.x kernel support is better than Mindcraft concluded.

Also, Mindcraft's April 13th press release neglected to mention that Microsoft sponsored the Mindcraft benchmarks, that the tests were carried out at a Microsoft lab, and that Mindcraft had access to the highest level of NT support imaginable.

Finally, Mindcraft did not try to purchase a support contract for Linux from e.g. LinuxCare or Red Hat, both of whom were offering commercial support at the time of Mindcraft's tests.

Mindcraft proposed repeating the benchmarks at an independant testing lab to address concerns that their testing was biased, but they have not yet addressed concerns that their conclusions about Linux kernel support are biased.

Truth or FUD?

Mindcraft probably did tune NT well and Linux poorly -- but rather than assume this fully accounts for Linux's poor showing, let's look for other things that could have contributed. I'm going to focus on the web tests, since that's what I'm familliar with.

Previous Benchmarks

Although Apache was designed for flexibility and correctness rather than raw performance, it has done quite well in benchmarks in the past. In fact, Ziff-Davis's January 1999 benchmarks showed that "Linux with Apache beats NT 4.0 with IIS, hands down". (Also, Apache currently beats Microsoft IIS at the single processor SPECWeb96 benchmark, but this is with special caching software.)

Yet Mindcraft found that Apache's performance falls off dramatically when there are more than 160 clients (see graph). Is this a contradiction?

Not really. The benchmarks done by Jef Poskanzer, the author of the high-performance server 'thttpd', showed that Apache 1.3.0 (among other servers) has trouble above 125 connections on Solaris 2.6. The number of clients served by Apache in the Ziff-Davis tests above was 40 or less, below the knee found by Poskanzer. By contrast, in the Mindcraft tests (and in the IIS SPECWeb96 results), the server was asked to handle over 150 clients, above the point where Poskanzer saw the dropoff.

Also, the January Ziff-Davis benchmarks used a server with only 64 megabytes of RAM, not enough to hold both the server code and the 60 megabyte WebBench 2.0 document tree used by both Mindcraft and Ziff-Davis, whereas Mindcraft used 960 megabytes of RAM.

So it's not suprising that the Jan '99 Ziff-Davis and April '99 Mindcraft tests of Apache got different results.

Does it matter?

These benchmarks are done on static pages, using very little of Apache's dynamic page generation power. Christopher Lansdown points out that the performance levels reached in the test correspond to sites that receive over a million hits per day on static pages. It's not clear that the results of such a test have much relevance to typical big sites, which tend to use a lot of dynamically generated pages.

Another objection to these benchmarks is that they don't accurately reflect the real world of many slow connections. A realistic benchmark for a heavily- trafficked Web server would involve 500 or 1000 clients all restricted to 28.8 or 56 Kbps, not 100 or so clients connected via 100baseT.

A benchmark that aims to deal with both of these concerns is the new SPECWeb99 benchmark. When it becomes available, it looks like it will set the standard for how web servers should be benchmarked.

Nevertheless, Linus seems to feel that until more realistic benchmarks (like SPECWeb99) become available, benchmarks like the one Mindcraft ran are an understandable if dirty compromise.

Kernel issue #1: TCP bug

Why does Apache fall off in the above tests above 160 active connections? It appears the steep falloff may have been due to a TCP stack problem reported by ariel@sgi.com and later by Karthik Prabhakar:

ariel@sgi.com (Ariel Faigon) reported on 3 May 1999 (updates added):
"A couple of items you may find interesting.
1. For a long time the web performance team at SGI has noted that among the three web servers we have been benchmarking: Apache, Netscape (both enterprise and fasttrack), and Zeus, Apache is (by far) the slowest. In fact an SGI employee (Mike Abbott) has done some optimizations which made Apache run 3 (!) times faster on SPECWeb 96 on IRIX. It is our intention to make these patches public soon. [They are now online at oss.sgi.com/projects/apache.]
2. When we tried to test our Apache patches on IRIX the expected 3x speedup was easy to achieve. However when we ported our changes to Linux (2.2.5), we were surprised to find that we don't even get into the scalability game. A 200ms delay in connection establishment in the TCP/IP stack in Linux 2.x was preventing Apache to respond to anything more than 5 connections per second. We have been in touch with David Miller on this and sent him a patch by Feng Zhou which eliminates this bottleneck. This patch ... has made it into the [2.2.7] kernel [after some modifications by David Miller]. So now we are back into optimizing Apache. ..." [The patch affected the files tcp.h and tcp_ipv4.c, e.g. it notes that Nagle shouldn't be used for the final FIN packet.]
  (6 May 1999): "As for our changes to Apache. They are much more significant and make Apache run 3 times faster on SPECWeb 96. I just talked to the author and made sure we are releasing them to the Apache group when we're ready. We just don't want to be too hasty in this. We want to make it right, clean and accepted by the Apache guys. The 'patch' here is pretty big. ... It includes:
  - A page cache
  - Performance tunables adjusted to the max
  - Changed some critical algorithms with faster ones (this is long, I hope to have more details when we release).
  - Eliminated system calls where they weren't needed (so Apache is less dependent on the underlying OS)"
Karthik Prabhakar reports on a problem with Apache 1.3.4 and 1.3.6 on Linux kernel 2.2.5:
"I've seen load fall off well below 160 clients (for eg., 3 specweb clients with 30 processes each). I can't explain it yet, especially the fact that the performance stays low even after the test concludes. This behavior seems limited to apache."
He has reported this as a bug to the Apache bug tracking system; see Apache bug #4268.
"The mystery continues. I got round to trying out 1.3.6 again this evening, this time on 2.2.7. I did _not_ see the performance drop off. Just to verify, I rechecked on the stock 2.2.5 kernel, and the drop off is there.
So _something_ has been fixed between 2.2.5 and 2.2.7 that has made this problem go away. I'll keep plugging away as I get spare time to see if I can get the problem to occur. ...
Compiling 1.3.6 in a reasonable way, along with a few minor tweaks in linux 2.2.7 gives me about 2-4 times the peak performance of the default 1.3.4 on 2.2.5. I simply compiled [Apache] with pretty much all modules disabled.... I'm using the highperformance-conf.dist config file from the distribution."
See also Karthik's post on linux-kernel and its followups.
This sounds rather like the behavior Mindcraft reported ("After the restart, Apache performance climbed back to within 30% of its peak from a low of about 6% of the peak performance").

Kernel issue #2: Wake-One vs. the Thundering Herd

(Note: According to the Linux Scalability Project's paper on the thundering herd problem, a "task exclusive" wake-one patch is now integrated into the 2.3 kernel; however, according to Andrea, as of 2.4.0-test10, it still wakes up processes in same order they were put to sleep, which is not optimal from a caching point of view. The reverse order would be better.

See also Nov 2000 measurements by Andrew Morton (andrewm@uow.edu.au); post 1, post 2, and Linus' reply.)

Phillip Ezolt, 5 May 1999, in linux-kernel ( "Overscheduling DOES happen with high web server load."):
"When running a SPECWeb96 strobe run on Alpha/linux, I found that when the CPU is pegged, 18% of the time is spent in the scheduler."
(Russinovich talked about something like this in his critique of Linux.)
This post started a very lively thread in linux-kernel (now on its second week). Looks like the scheduler (and possibly Apache) are in for some changes.
Rik van Riel, 6 May 1999, in linuxperf (Re: [linuxperf] Possible fix for Mindcraft Apache problem):
... The main bug with the web benchmark remains. The way Linux and Apache 'cooperate', there's a lot of trouble with the 'thundering herd' problem. That is, when a signal comes in, all processes are woken up and the scheduler has to select one from the dozens of new runnable processes.... The real solution is to go from wake-all semantics to a wake-one style so we won't have the enormous runqueues the guy at DEC [Phillip Ezolt] experienced. The good news is that it's a simple patch that can probably be fixed within a few days...
Tony Gale, 6 May 1999, in linuxperf ( Re: [linuxperf] Possible fix for Mindcraft Apache problem):
Apache uses file locking to serialise access to the accept call. This can be very expensive on some systems. I haven't found the time to run the numbers on Linux yet for the 10 or so different server models that can be employed to see which is the most efficient. Check Stephens UNPv1 2nd Edition Chapter 27 for details.
Andrea Arcangeli, May 12th, 1999, in linux-kernel ( [patch] wake_one for accept(2) [was Re: Overscheduling DOES happenwith high web server load.] and 2.2.8_andrea1.bz2):
I released a new andrea-patch against 2.2.8. This new one has my new wake-one on accept(2) strightforward code (but to get the improvement you must make sure that your apache tasks are sleeping in accept(2), a strace -p `pidof apache` should tell you that).
The patch is linked to from here.
David Miller's reply to the above:
... on every new TCP connection, there will be 2 spurious and unnecessary wakeups, and these originate in the write_space socket callback because as we free up the SYN frames we wakeup listening socket sleepers. I've been working today on solving this very issue.
Ingo Molnar, May 13th, 1999, in linux-kernel ( Re: [RFT] 2.2.8_andrea1 wake-one [Re: Overscheduling DOES happen with high web server load.]):
note that pre-2.3.1 already has a wake-one implementation for accept() ... and more coming up.
Phillip Ezolt (ezolt@perf.zko.dec.com), May 14th, 1999, in linux-kernel ( Great News!! Was: [RFT] 2.2.8_andrea1 wake-one ):
I've been doing some more SPECWeb96 tests, and with Andrea's patch to 2.2.8 (ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.8_andrea1.bz)
**On identical hardware, I get web-performance nearly identical to Tru64!** ...
Tru64 ~4ms
2.2.5 ~100ms
2.2.8 ~9ms
2.2.8_a ~4ms
... Time spent in schedule has decreased, as shown by this Iprobe data: ...
The number of SPECWeb96 MaxOps per second have jumped has well.
**Please, put the wakeone patch into the 2.2.X kernel if it isn't already. **
Larry Sendlosky tried this patch, and says:
Your 2.2.8 patch really helps apache performance on a single cpu system, but there is really no performance improvement on a 2 cpu SMP system.

The latest version of the wake-one patch is listed below.

Kernel issue #3: SMP Bottlenecks in 2.2 Kernel

Juergen Schmidt, May 19th, 1999, in linux-kernel ( Bad apache perfomance wtih linux SMP), asked what could make Apache do poorly under SMP.
Andi Kleen replied:
One culprit is most likely that the data copy for TCP sending runs completely serialized. This can be fixed by replacing the
```
skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err);
```
in tcp.c:tcp_do_sendmsg with
```
unlock_kernel(); 
skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err);
lock_kernel(); 
```
The patch does not violate any locking requirements in the kernel...
[To fix your connection refused errors,] try:
```
echo 32768 > /proc/sys/fs/file-max
echo 65536 > /proc/sys/fs/inode-max
```
Overall it should be clear that the current Linux kernel doesn't scale to CPUs for system load (user load is fine). I blame the Linux vendors for advertising it, although it is not true. ... Work to fix all these problems is underway [2.3 will be fixed first, then the changes will be backported to 2.2].
[Note: Andi's TCP unlocking fix appears to be in 2.2.9-ac3.]
Andrea Arcangeli responded describing his own version of this fix ( ftp://ftp.suse.com/pub/people/andrea/kernel/2.3.3_andrea2.bz2 ) as less cluttered:
If you look at my patch (the second one, in the first one I missed the reaquire_kernel_lock done before returning from schedule, woops :) then you'll see my approch to address the unlock-during-uaccess. My patch don't change tcp/ip ext2 etc... but it touch only uaccess.h and usercopy.c. I don't like to put unlock_kernel all over the place.
Juergen Schmidt, 26 May 1999, on linux-kernel and new-httpd, ( Linux/Apache and SMP - my fault ), retracted his earlier problem report:
I reported "disastrous" performance for Linux and Apache on a SMP system.
To doublecheck, I've downloaded a clean kernel source (2.2.8 and 2.2.9) and had to realize, that those do *not* show the reported penalty when running on SMP systems.
My error was to use the installed kernel sources (which I patched from 2.2.5 to 2.2.8 - after seeing the first very bad results). But those sources already had been modified before the machine came to me. Should have thrown them away in the first place :-( ...
Please take my excuses for this confusion.
Others have reported modest performance gains (20% or so) with Andrea's SMP fix, but only when serving largish files (100 kilobytes).
Juergen has now finished his testing. Unfortunately, he neglected to compile Apache with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT, which ( according to Andrea) significantly hurt Apache performance.
If Juergen missed that, it means it's too hard to figure out. To make it easier to get good performance in the future, we need the wake-one patch added to a stable kernel (say, 2.2.10), and we need Apache's configuration script to notice that the system is being compiled for 2.2.10 or later, and automatically select SINGLE_LISTEN_UNSERIALIZED_ACCEPT.

Other Apache users getting help solving performance problems

Mike Whitaker (mike@altrion.org), 22 May 1999, in linuxperf ( High load under Apache1.3.3/mod_perl 1.16/Linux 2.2.7 SMP ), described an interesting performance problem:
Our typical webserver is a dual PII450 with 1G, and split httpd's, typically 300 static to serve the pages and proxy to 80-100 dynamic to serve the mod_perl adverts. Unneeded modules are diabled and hostname lookups turned off, as any sensible person would.
There's typically between one and three mod_perl hits/page on top of the usual dozen or so inline images...
The kernel (2.2.7) has MAX_TASKS upped to 4090, and the unlock_kernel/lock_kernel around csum_and_copy_from_user() in tcp_do_sendmsg that Andi Kleen suggested.
Performance is .. interesting. Load on the machine fluctuates between 10 and 120, while the user CPU goes from 15% (80% idle) to 180% (0% idle, machine *crawling*), about once every minute and a half. vmstat shows the number of processes in a run state to range from 0 (when load is low) to 30-40, and the static servers manage a mighty 60-70 peak hits/sec. Without the dynamic httpd's everything *flies*...
After being advised to try a kernel with wake-one support, he wrote back:
We're up with 2.3.3 plus Andi Kleen's tcp_do_sendmsg patch plus Apache sleeping in accept() on one production server, and comparing it against a 2.2.7 plus tcp_do_sendmsg patch plus Apache sleeping in flock(). Identical systems (dual PII450, 1G, two disk controllers).
As far as I can *tell*, the wake-one patch is definitely doing its stuff: the 2.2.7 machine still has cycles of load into three figures, and the 2.3.3 machine hasn't actully managed a load of 1 yet.
UNFORTUNATELY, observation suggests that the 2.3.3 machine/Apache combination is dropping/ignoring about one connection in ten, maybe more. (Network error: connection reset by peer.)
His next update, on May 25th, reads:
More progress from the bleeding edge:
(Reminder: the config here is split static/mod_perl httpd's, with a pretty CPU-intensive mod_perl script serving ads as an SSI as the probable bottleneck)
Linux kernel 2.2.9 plus the 2.2.9_andrea3 (wake-one) patch seems to do the trick: can handle hits at a speed which suggests it's pushing the adverser to close to its observed maximum. (As I said in a previous note, avoid 2.2.8 like the plague: it trashes HDs - see threads on linux-kernel for details.)
However... When it *does* get overstressed, BOY does it get overstressed. Once the idle CPU drops to zero (i.e. its spending most of its time processing advert requests, everything goes unpleasantly pearshaped, with a load of 400+, and the number of httpd's on both types of server *well* above MaxClients (in fact, suspiciously close to MaxClients + MinSpareServers). Spikes in demand can cause this, and once you get into this state, getting out again under the load of prgressively more backlogging requests is not easy: in fact from experience the only way is to which the machine out of the (hopefully short TTL) DNS round-robin while it dies down.
The potentially counterintuitive step at this point is to *REDUCE* MaxClients, and hope that the tcp Listen queue will handle a load surge. Experience suggests this does in fact work.
(Aside: this is a perfect case for using something like Eddieware's load balancing DNS).
Eric Hicks, 26 May 1999, in linux-kernel ( Apache/kernel problem? ):
... I'm having some big problems in which it appears that a single PII 400Mhz or a single AMD 400 will outrun a dual PII 450 at http requests from Apache. ...
Data for HTTP Server Tests: 100 1MByte mpeg files stored on local disks. Results:
- AMD 400Mghz K6, 128MB, Linux 2.0.36; handles 1000 simultaneous clients @ 57.6Kbits/sec.
- PII 400Mghz, 512MB, Linux 2.0.36; handles 1000 simultaneous clients @ 57.6Kbits/sec.
- Dual PII/450Mghz, 512MB, Linux 2.0.36 and 2.2.8; handles far less than 300 simultaneous @57.6Kbits/sec [and even then, clients were seeing 5 second connection delays, and 'reset by peer' and 'connection time out' errors]
I advised him to try 2.2.9_andrea3; he said he'd try it and report back.

Kernel issue #4: Interrupt Bottlenecks

According to Zach, the Mindcraft benchmark's use of four Fast Ethernet cards and a quad SMP system exposes a bottleneck in Linux's interrupt processing; the kernel spent a lot of time in synchronize_bh(). (A single Gigabit Ethernet card would stress this bottleneck much less.) According to Mingo, TCP throughput scales much better with number of CPUs in 2.3.9 than it did in 2.2.10, although he hasn't tried it with multiple Ethernets yet.

See also comments on reducing interrupts under heavy load by Steve Underwood and Steven Guo.

See also Linus's "State of Linux" talk at Usenix '99 where he talks about the Mindcraft benchmark and SMP scalability.

Softnet is coming! Kernel 2.3.43 adds the new softnet networking changes. Softnet changes the interface to the networking cards, so every single driver needs updating, but in return network performance should scale much better on large SMP systems. (For more info, see Alexy's readme.softnet, his softnet-howto, or his Feb 15 post about how to convert old drivers.)

The Feb '00 thread Gigabit Ethernet Bottlenecks (especially its second week) has lots of interesting tidbits about how what interrupt (and other) bottlenecks remain, and how they are being addressed in the 2.3 kernel.

Ingo Molnar's post of 27 Feb 2000 describes interrupt-handling improvements in the IA32 code in great detail. These will be moving into the core kernel in 2.5, it seems.

Kernel issue #5: Mysterious network slowdown

This one is a bug, not a scalability issue.
Several 2.2 users have reported that sometimes networking slows down to 1 to 10% of normal, with high ping times, and that cycling the interface fixes the problem temporarily.

�ystein Svendsen reported on 29 June 1999:
After upgrading to the 2.2 series we have from time to time experienced severe slow-downs on the TCP performance... The performance goes back to normal when I take down the interface and reinsert the eepro100 module into the kernel. After I've done that, the performance is fine for a couple of days or maybe weeks.
David Stahl reported on 29 June 1999:
I've got 3 machines running 2.2.10 [with multiple] 3COM 3C905/905b PCI [cards]... After approximately 2 days of uptime, I will start to see ping times on the local lan jump to 7-20 seconds. As others have noticed, there is no loss -- just some damn high latency. ... It seems to be dependant upon the network load -- lighter loads lead to longer periods between problems. The problem ALSO is gradual -- it'll start at 4 second pings, then 7 second pings about 20 minutes later, than 30 minutes later it's up to 12-20 seconds.
Another eepro100 report.
A tulip report. Less repeatable.
David Stahl wrote on 13 July 1999:
What DID fix the problem was a private reply from someone elese (sorry about the credit, but i'm not in the mood to sieve 10k emails right now), to try the alpha version of the latest 3c59x.c driver from Donald Becker (http://cesdis.gsfc.nasa.gov/linux/drivers/vortex.html).
3c59x.c:v0.99L 5/28/99 is the version that fixed it, from ftp://cesdis.gsfc.nasa.gov/pub/linux/drivers/test/3c59x.c
On 23 Sep 1999, Alexey posted a one-line patch that clears up a similar mysterious slowdown. 2.2.13 and Red Hat 6.1 already have this patch applied. On three Red Hat 6.0 systems I know of with Masq support compiled in, connected to cable modems, this patch fixed a bug which caused very high pings after even short bursts of heavy TCP transfers to distant hosts.
Rickard Cedergren and Michael Brown reported about October 21st on linux-kernel that that although Alexey's patch greatly improved the problem, it is not totally gone.
Tony Hoyle is also seeing occasional long delays with 2.2.13.
Jeremy Fitzhardinge reported another big delay; the replies say it's likely caused by a particular Tulip driver.

Kernel issue #6: 2.2.x/NT TCP slowdown

Petru Paler, July 10 1999, in linux-kernel ( [BUG] TCP connections between Linux and NT ) reported that any kind of TCP connection between Linux (2.2.10) and a NT Server 4 (Service Pack 5) slows down to a crawl. The problem was much milder (6kbytes/sec) with 2.0.37. He included a log of a slow connection made with tcpdump, which helped Andi Kleen see that NT was taking a long time to ACK a data packet, which was causing Linux to throttle back..
Solved: false alarm! It wasn't Linux' fault at all. Turns out NT needed to be told to not use full duplex mode on the ethernet card.

Kernel issue #7: Scheduler

Phil Ezolt, 22 Jan 2000, in linux-kernel ( Re: Interesting analysis of linux kernel threading by IBM):

When I run SPECWeb96 tests here, I see both a large number of running process and a huge number of context switches. ... Here's a sample of the vmstat data:
procs memory swap io system cpu 
 r b w swpd free    buff   cache   si so bi bo   in    cs    us sy id 
...
24 0 0 2320 2066936 590088 1061464 0  0  0  0    8680  7402  3  96  1 
24 0 0 2320 2065752 590664 1061464 0  0  0  1095 11344 10920 3  95  1 
Notice. 24 running process and ~7000 context switches.
That is a lot of overhead. Every second, 7000*24 goodnesses are calculated. Not the (20*3) that a desktop system sees. This is a scalability issue. A better scheduler means better scalability.
Don't tell me benchmark data is useless. Unless you can give me data using a real system and where it's faults are, benchmark data is all we have.
SPECWeb96 pushes Linux until it bleeds. I'm telling you where it bleeds. You can fix it or bury your head in the sand. It might not be what your system is seeing today, but it will be in the future.
Would you rather fix it now or wait until someone else how thrown down the performance gauntelet?
...
Here's a juicy tidbit. During my runs, I see 98% contention on the [2.2.14] kernel lock, and it is accessed ALOT. I don't know how 2.3.40 compares, because I don't have big memory support for it. Hopefully, Andrea will be kind enough give me a patch, and then I can see if things have improved.

[Phil's data is for the web server undergoing the SPECWeb96 test, which is an ES40 4 CPU alpha EV6 running Redhat 6.0 w/kernel v2.2.14 and Apache-v1.3.9 w/SGI performance patches; the interfaces receiving the load are two ACENic gigabit ethernet cards.]

Kernel issue #8: SMP Bottlenecks in 2.4 kernel

Manfred Spraul, April 21, 2000, in linux-kernel ( [PATCH] f_op->poll() without lock_kernel()):

kumon@flab.fujitsu.co.jp noticed that select() caused a high contention for the kernel lock, so here is a patch that removes lock_kernel() from poll(). [tested] with 2.3.99-pre5.

There was some discussion about whether this was wise at this late date, but Linus and David Miller were enthusiastic. Looks like one more bottleneck bites the dust.

On 26 April 2000, kumon@flab.fujitsu.co.jp posted benchmark results in Linux-Kernel with and without the lock_kernel() in poll(). The followups included a kernel patch to improve checksum performance and a patch for Apache 1.3 to force it to align its buffers to 32-word boundaries. The latter patch, by Dean Gaudet, earned praise from Linus, who relayed rumors that this can speed up SPECWeb results by 3%. This was an interesting thread.

See also LWN's coverage, and the paragraph below, in which Kumon presents some benchmark results and another patch.

Kernel issue #9: csum_partial_copy_generic

kumon@flab.fujitsu.co.jp, 19 May 2000, in linux-kernel ( [PATCH] Fast csum_partial_copy_generic and more ) reports a 3% reduction in total CPU time compared to 2.3.99-pre8 on i686 by optimizing the cache behavior of csum_partial_copy_generic. The workload was ZD's WebBench. He adds

The benchmark we used has almost same setting as the MINDCRAFT ones, but the apache setting is [changed] slightly not to use symlink checking.
We used maximum of 24 independent clients and number of apache processes is 16. A four-way XEON procesor system is used, and the performance is twice and more than a single CPU performance.

Note that in ZD's benchmarks with 2.2.6, a 4 CPU system only achieved a 1.5x speedup over a single CPU. Kumon is reporting a > 2x speedup. This appears to be about the same speedup NT 4.0sp3 achieved with 4 CPUs at that number of clients (24). It's encouraging to hear that things may have improved in the 11 months since the 2.2.6 tests. When I asked him about this, Kumon said

Major improvement is between pre3 and pre5, poll optimization. Until pre4 (I forget exact version), kernel-lock prevents performance improvement.
If you can retrieve l-k mails around Apr 20-25, the following mails will help you understand the background.
subject: namei() query
subject: [PATCH] f_op->poll() without lock_kernel()
subject: lockless poll() (was Re: namei() query)
subject: "movb" for spin-unlock (was Re: namei() query)

On 4 Sept 2000, kumon posted again, noting that his change still hadn't made it into the kernel.

Kernel issue #10: getname(), poll() optimizations

On 22 May 2000, Manfred Spraul posted a patch on linux-kernel which optimized kmalloc(), getname(), and select() a bit, speeding up apache by about 1.5% on 2.3.99-pre8.

Kernel issue #11: Reducing lock contention, poll overhead in 2.4

On 30 May 2000, Alexander Viro posted a patch that got rid of a big lock in close_flip() and _fput(), and asked for testing. kumon ran a benchmark, and reported:

I measured viro's ac6-D patch with WebBench on 4cpu Xeon system. I applied to 2.4.0-test1 not ac6. The patch reduced 50% of stext_lock time and 4% of the total OS time. ... Some part of kmalloc/kfree overhead is come from do_select, and it is easily eliminated using small array on a stack.

kumon then posted a patch that avoids kmalloc/kfree in select() and poll() when # of fd's involved is under 64.

Kernel issue #12: Poor disk seek behavior in 2.2, new elevator code in 2.4

On 20 July 2000, Robert Cohen (robert@coorong.anu.edu.au) posted a report in Linux-kernel listing netatalk (appletalk file sharing) benchmarks comparing 2.0, 2.2, and several versions of 2.4.0-pre. The elevator code in 2.4 seems to help (some versions of 2.4 can handle 5 benchmark clients instead of 2) but ...

The more recent test4 and test5pre2 don't fair quite so well. They handle 2 clients on a 128 Meg server fine, so they're doing better than 2.2 but they choke and go seek bound with 4 clients. So something has definitely taken a turn for the worse since test1-ac22.

Here's an update. The *only* 2.4 kernel versions that could handle 5 clients were 2.4.0-test1-ac22-riel and 2.4.0-test1-ac22-class 5+; everything before and after (up to 2.4.0-test5pre4) can only handle 2.

On 26 Sept 2000, Robert Cohen posted an update which included a simple program to demonstrate the problem, which appears to be in the elevator code. Jens Axboe (axboe@suse.de) responded that he and Andrea had a patch almost ready for 2.4.0-test9-pre5 that fixes this problem.

On 4 Oct 2000, Robert Cohen posted an update with benchmark results for many kernels, showing that the problem still exists in 2.4.0-test9.

Kernel issue #13: Fast Forwarding / Hardware Flow Control

On 18 Sept 2000, Jamal (hadi@cyberus.ca) posted a note in Linux-kernel describing proposed changes to the 2.4 kernel's network driver interface; the changes add hardware flow control and several other refinements. He says

Robert Olson and I decided after the OLS that we were going to try to hit the 100Mbps(148.8Kpps) routing peak by year end. I am afraid the bar has been raised. Robert is already hitting with 2.4.0-test7 ~148Kpps with a ASUS CUBX motherboard carrying PIII 700 MHZ coppermine with about 65% CPU utilization. With a single PII based Dell machine i was able to get a consistent value of 110Kpps.
So the new goal is to go to about 500Kpps ;-> (maybe not by year end, but surely by that next random Linux hacker conference)
A sample modified tulip driver (hacked by Alexey for 2.2 and mod'ed by Robert and myself over a period of time) is supplied as an example on how to use the feedback values. ...
I believe we could have done better with the mindcraft tests with these changes in 2.2 (and HW FC turned on).
[update] BTW, I am informed that Linux people were _not_ allowed to change the hardware for those tests, so I dont think they could have used these changes if they were available back then.

Kernel tuning issue: hitting TIME_WAIT

On 30 March 2000, Takashi Richard Horikawa posted a report in Linux-Kernel listing SPECWeb96 results for both the 2.2.14 and 2.3.41. Performance between a 2.2.14 client and a 2.2.14 server was poor because few enough ports were being used that ports were not done with TIME_WAIT by the time that port number was needed again for a new connection. The moral of the story may be to tune the client and servers to use as large a port range as possible, e.g. with

echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range

to avoid bumping into this situation when trying to simulate large numbers of clients with a small number of client machines.

On 2 April 2000, Mr. Horikawa confirmed that increasing the local port range with the above command solved the problem.

Suggestions for future benchmarks

Become familliar with linux-kernel and the Apache mailing lists as well as the Linux newsgroups on Usenet (try DejaNews power searches in forums matching '*linux*').
Post your proposed configuration and see whether people agree with it. Also, be open about your benchmark; post intermediate results, and see if anyone has suggestions for improvements. You should probably expect to spend a week or so mulling over ideas with these mailing lists during the course of your tests.

If possible, use a modern benchmark like SPECWeb99 rather than the simple ones used by Mindcraft.

It might be interesting to inject latency into the path between the server and the clients to more realistically model the situation on the Internet.

Benchmark both single and multiple CPUs, and single and multiple Ethernet interfaces, if possible. Be aware that the networking performance of version 2.2.x of the Linux kernel does not scale well as you add more CPUs and Ethernet cards. This applies mostly to static pages and cached dynamic pages; noncached dynamic pages usually take a fair bit of CPU time, and should scale very well when you add CPUs. If possible, use a cache to save commonly generated pages; this will bring the dynamic page speeds closer to the static page speeds.

When testing dynamic content: Don't use the old model of running a separate process for each request; nobody running a big web site uses that interface anymore, as it's too slow. Always use a modern dynamic content generation interface (e.g. mod_perl for Apache).

Configuring Linux

Tuning problems probably resulted in less than 20% performance decrease in Mindcraft's test, so as of 3 October 1999, most people will be happy with a stock 2.2.13 kernel or whatever comes with Red Hat 6.1. The 2.4 kernel, when it's available, will help with SMP performance.

Here are some notes if you want to see what people going for the utmost were trying in June:

As of June 1, Linux kernel 2.2.9 plus 2.2.9_andrea3 have been mentioned as performing well on a dual-processor task (see above). (2.2.9_andrea3 seems to include both a wake-one scheduler fix as well as an SMP unlock_kernel fix.) (andrea3 only works on x86, I hear, so people with Alphas or PPC's will have to apply some other wake-one and tcp copy kernel_unlock patch.)
Jan Gruber writes: "the 2.2.9_andrea3-patch doesn't compile with SMP Support disabled. Andrea told me to use ftp://ftp.suse.com/pub/people/andrea/kernel-patches/2.2.9_andrea-VM4.gz instead."
On 7 June, Andrea Arcangeli asked:
If you are going to do bench I would like if you would bench also the patch below.
ftp://e-mind.com/pub/andrea/kernel-patches/2.2.9_andrea-perf1.gz
On 11 Oct 1999, Andrea Arcangeli posted his list of pending 2.2.x patches, waiting to go into 2.2.13 or so. This includes several that might help performance of SMP systems and systems undergoing heavy I/O. You might consider trying these if you run into bottlenecks.
The truly daring may wish to try using the kernel-mode http server, khttpd, as a front-end for Apache. It accellerates static web page fetches greatly. It's at version 0.1, so use caution.
linux-kernel ( week 1, week 2 ) is currently (8 June 1999) discussing benchmarking Apache. Linus Torvalds is in principle bullish on using khttpd or something like it, and points out that NT is doing the same kind of thing.

Configuring Apache

The usual optimizations should be applied (all unused modules should be left out when compiling, host name lookup should be disabled, and symbolic links should be followed; see http://www.apache.org/docs/misc/perf-tuning.html)

Apache should be compiled to block in accept, e.g.

 env CFLAGS='-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT' ./configure

The http://www.arctic.org/~dgaudet/apache/1.3/top_fuel.patch may be worth applying. PC Week used top_fuel in their recent benchmarks. (See also interesting comments by Dean Gaudet in linux-kernel and new-httpd.) Supposedly, applying top_fuel.patch and using mod_mmap_static on a set of documents can reduce the number of syscalls per request from 18 to 9.
For static file benchmarks, try compiling mod_mmap_static into Apache (see http://www.apache.org/docs/mod/mod_mmap_static.html) and configuring Apache to memory-map the static documents, e.g. by creating a config file like this:
```
find /www/htdocs -type f -print | sed -e 's/.*/mmapfile &/' > mmap.conf
```
and including mmap.conf in your Apache config file.
Several people have mentioned that using Squid as a front-end to Apache would greatly accellerate static web page fetches.