Executive summary:
The Mindcraft benchmark proved to be a wake-up call, and the Linux
community responded effectively.
Several problems which caused Apache to run slowly on
Linux were found and resolved.
I've split the summary of recent benchmark results (including the Mindcraft benchmarks) off into a separate page, "NT vs. Linux Server Benchmark Graphs", at the request of readers who liked the data but weren't interested in the history of the Linux kernel hacking community's response to the benchmarks.
The responses generally claimed that Mindcraft had not configured Linux properly, and gave specific examples. Both Mindcraft and the Linux community agree that good tuning information for Linux is hard to find.
Mindcraft did not seem to take the time to become familliar with all the appropriate forums for discussion, and apparantly did not respond to requests for further information (see section III of Eric Green's response). Others have had better success; in particular, three key kernel improvements all came about in the course of normal support activities on Usenet, the linux-kernel mailing list, and the Apache bug tracking database. I believe the cases illustrated below indicate that free 2.2.x kernel support is better than Mindcraft concluded.
Also, Mindcraft's April 13th press release neglected to mention that Microsoft sponsored the Mindcraft benchmarks, that the tests were carried out at a Microsoft lab, and that Mindcraft had access to the highest level of NT support imaginable.
Finally, Mindcraft did not try to purchase a support contract for Linux from e.g. LinuxCare or Red Hat, both of whom were offering commercial support at the time of Mindcraft's tests.
Mindcraft proposed repeating the benchmarks at an independant testing lab to address concerns that their testing was biased, but they have not yet addressed concerns that their conclusions about Linux kernel support are biased.
Yet Mindcraft found that Apache's performance falls off dramatically when there are more than 160 clients (see graph). Is this a contradiction?
Not really. The benchmarks done by Jef Poskanzer, the author of the high-performance server 'thttpd', showed that Apache 1.3.0 (among other servers) has trouble above 125 connections on Solaris 2.6. The number of clients served by Apache in the Ziff-Davis tests above was 40 or less, below the knee found by Poskanzer. By contrast, in the Mindcraft tests (and in the IIS SPECWeb96 results), the server was asked to handle over 150 clients, above the point where Poskanzer saw the dropoff.
Also, the January Ziff-Davis benchmarks used a server with only 64 megabytes of RAM, not enough to hold both the server code and the 60 megabyte WebBench 2.0 document tree used by both Mindcraft and Ziff-Davis, whereas Mindcraft used 960 megabytes of RAM.
So it's not suprising that the Jan '99 Ziff-Davis and April '99 Mindcraft tests of Apache got different results.
Another objection to these benchmarks is that they don't accurately reflect the real world of many slow connections. A realistic benchmark for a heavily- trafficked Web server would involve 500 or 1000 clients all restricted to 28.8 or 56 Kbps, not 100 or so clients connected via 100baseT.
A benchmark that aims to deal with both of these concerns is the new SPECWeb99 benchmark. When it becomes available, it looks like it will set the standard for how web servers should be benchmarked.
Nevertheless, Linus seems to feel that until more realistic benchmarks (like SPECWeb99) become available, benchmarks like the one Mindcraft ran are an understandable if dirty compromise.
Kernel issue #1: TCP bug
Why does Apache fall off in the above tests above 160 active connections?
It appears the steep falloff may have been due to a TCP stack problem
reported by ariel@sgi.com and later by Karthik Prabhakar:
"A couple of items you may find interesting.
- For a long time the web performance team at SGI has noted that among the three web servers we have been benchmarking: Apache, Netscape (both enterprise and fasttrack), and Zeus, Apache is (by far) the slowest. In fact an SGI employee (Mike Abbott) has done some optimizations which made Apache run 3 (!) times faster on SPECWeb 96 on IRIX. It is our intention to make these patches public soon. [They are now online at oss.sgi.com/projects/apache.]
- When we tried to test our Apache patches on IRIX the expected 3x speedup was easy to achieve. However when we ported our changes to Linux (2.2.5), we were surprised to find that we don't even get into the scalability game. A 200ms delay in connection establishment in the TCP/IP stack in Linux 2.x was preventing Apache to respond to anything more than 5 connections per second. We have been in touch with David Miller on this and sent him a patch by Feng Zhou which eliminates this bottleneck. This patch ... has made it into the [2.2.7] kernel [after some modifications by David Miller]. So now we are back into optimizing Apache. ..." [The patch affected the files tcp.h and tcp_ipv4.c, e.g. it notes that Nagle shouldn't be used for the final FIN packet.]
(6 May 1999): "As for our changes to Apache. They are much more significant and make Apache run 3 times faster on SPECWeb 96. I just talked to the author and made sure we are releasing them to the Apache group when we're ready. We just don't want to be too hasty in this. We want to make it right, clean and accepted by the Apache guys. The 'patch' here is pretty big. ... It includes:
- A page cache
- Performance tunables adjusted to the max
- Changed some critical algorithms with faster ones (this is long, I hope to have more details when we release).
- Eliminated system calls where they weren't needed (so Apache is less dependent on the underlying OS)"
"I've seen load fall off well below 160 clients (for eg., 3 specweb clients with 30 processes each). I can't explain it yet, especially the fact that the performance stays low even after the test concludes. This behavior seems limited to apache."See also Karthik's post on linux-kernel and its followups.He has reported this as a bug to the Apache bug tracking system; see Apache bug #4268.
"The mystery continues. I got round to trying out 1.3.6 again this evening, this time on 2.2.7. I did _not_ see the performance drop off. Just to verify, I rechecked on the stock 2.2.5 kernel, and the drop off is there.
So _something_ has been fixed between 2.2.5 and 2.2.7 that has made this problem go away. I'll keep plugging away as I get spare time to see if I can get the problem to occur. ...
Compiling 1.3.6 in a reasonable way, along with a few minor tweaks in linux 2.2.7 gives me about 2-4 times the peak performance of the default 1.3.4 on 2.2.5. I simply compiled [Apache] with pretty much all modules disabled.... I'm using the highperformance-conf.dist config file from the distribution."
This sounds rather like the behavior Mindcraft reported ("After the restart, Apache performance climbed back to within 30% of its peak from a low of about 6% of the peak performance").
See also Nov 2000 measurements by Andrew Morton (andrewm@uow.edu.au); post 1, post 2, and Linus' reply.)
"When running a SPECWeb96 strobe run on Alpha/linux, I found that when the CPU is pegged, 18% of the time is spent in the scheduler."(Russinovich talked about something like this in his critique of Linux.)
This post started a very lively thread in linux-kernel (now on its second week). Looks like the scheduler (and possibly Apache) are in for some changes.
... The main bug with the web benchmark remains. The way Linux and Apache 'cooperate', there's a lot of trouble with the 'thundering herd' problem. That is, when a signal comes in, all processes are woken up and the scheduler has to select one from the dozens of new runnable processes.... The real solution is to go from wake-all semantics to a wake-one style so we won't have the enormous runqueues the guy at DEC [Phillip Ezolt] experienced. The good news is that it's a simple patch that can probably be fixed within a few days...
Apache uses file locking to serialise access to the accept call. This can be very expensive on some systems. I haven't found the time to run the numbers on Linux yet for the 10 or so different server models that can be employed to see which is the most efficient. Check Stephens UNPv1 2nd Edition Chapter 27 for details.
I released a new andrea-patch against 2.2.8. This new one has my new wake-one on accept(2) strightforward code (but to get the improvement you must make sure that your apache tasks are sleeping in accept(2), a strace -p `pidof apache` should tell you that).The patch is linked to from here.
... on every new TCP connection, there will be 2 spurious and unnecessary wakeups, and these originate in the write_space socket callback because as we free up the SYN frames we wakeup listening socket sleepers. I've been working today on solving this very issue.
note that pre-2.3.1 already has a wake-one implementation for accept() ... and more coming up.
I've been doing some more SPECWeb96 tests, and with Andrea's patch to 2.2.8 (ftp://ftp.suse.com/pub/people/andrea/kernel/2.2.8_andrea1.bz)Larry Sendlosky tried this patch, and says:
**On identical hardware, I get web-performance nearly identical to Tru64!** ...
Tru64 ~4ms
2.2.5 ~100ms
2.2.8 ~9ms
2.2.8_a ~4ms
... Time spent in schedule has decreased, as shown by this Iprobe data: ...
The number of SPECWeb96 MaxOps per second have jumped has well.
**Please, put the wakeone patch into the 2.2.X kernel if it isn't already. **
Your 2.2.8 patch really helps apache performance on a single cpu system, but there is really no performance improvement on a 2 cpu SMP system.
See also:
Andi Kleen replied:
One culprit is most likely that the data copy for TCP sending runs completely serialized. This can be fixed by replacing the[Note: Andi's TCP unlocking fix appears to be in 2.2.9-ac3.]skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err);in tcp.c:tcp_do_sendmsg withunlock_kernel(); skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err); lock_kernel();The patch does not violate any locking requirements in the kernel...
[To fix your connection refused errors,] try:echo 32768 > /proc/sys/fs/file-max echo 65536 > /proc/sys/fs/inode-maxOverall it should be clear that the current Linux kernel doesn't scale to CPUs for system load (user load is fine). I blame the Linux vendors for advertising it, although it is not true. ... Work to fix all these problems is underway [2.3 will be fixed first, then the changes will be backported to 2.2].
Andrea Arcangeli responded describing his own version of this fix ( ftp://ftp.suse.com/pub/people/andrea/kernel/2.3.3_andrea2.bz2 ) as less cluttered:
If you look at my patch (the second one, in the first one I missed the reaquire_kernel_lock done before returning from schedule, woops :) then you'll see my approch to address the unlock-during-uaccess. My patch don't change tcp/ip ext2 etc... but it touch only uaccess.h and usercopy.c. I don't like to put unlock_kernel all over the place.Juergen Schmidt, 26 May 1999, on linux-kernel and new-httpd, ( Linux/Apache and SMP - my fault ), retracted his earlier problem report:
I reported "disastrous" performance for Linux and Apache on a SMP system.Others have reported modest performance gains (20% or so) with Andrea's SMP fix, but only when serving largish files (100 kilobytes).
To doublecheck, I've downloaded a clean kernel source (2.2.8 and 2.2.9) and had to realize, that those do *not* show the reported penalty when running on SMP systems.
My error was to use the installed kernel sources (which I patched from 2.2.5 to 2.2.8 - after seeing the first very bad results). But those sources already had been modified before the machine came to me. Should have thrown them away in the first place :-( ...
Please take my excuses for this confusion.
Juergen has now finished his testing. Unfortunately, he neglected to compile Apache with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT, which ( according to Andrea) significantly hurt Apache performance.
If Juergen missed that, it means it's too hard to figure out. To make it easier to get good performance in the future, we need the wake-one patch added to a stable kernel (say, 2.2.10), and we need Apache's configuration script to notice that the system is being compiled for 2.2.10 or later, and automatically select SINGLE_LISTEN_UNSERIALIZED_ACCEPT.
Our typical webserver is a dual PII450 with 1G, and split httpd's, typically 300 static to serve the pages and proxy to 80-100 dynamic to serve the mod_perl adverts. Unneeded modules are diabled and hostname lookups turned off, as any sensible person would.After being advised to try a kernel with wake-one support, he wrote back:There's typically between one and three mod_perl hits/page on top of the usual dozen or so inline images...
The kernel (2.2.7) has MAX_TASKS upped to 4090, and the unlock_kernel/lock_kernel around csum_and_copy_from_user() in tcp_do_sendmsg that Andi Kleen suggested.
Performance is .. interesting. Load on the machine fluctuates between 10 and 120, while the user CPU goes from 15% (80% idle) to 180% (0% idle, machine *crawling*), about once every minute and a half. vmstat shows the number of processes in a run state to range from 0 (when load is low) to 30-40, and the static servers manage a mighty 60-70 peak hits/sec. Without the dynamic httpd's everything *flies*...
We're up with 2.3.3 plus Andi Kleen's tcp_do_sendmsg patch plus Apache sleeping in accept() on one production server, and comparing it against a 2.2.7 plus tcp_do_sendmsg patch plus Apache sleeping in flock(). Identical systems (dual PII450, 1G, two disk controllers).His next update, on May 25th, reads:As far as I can *tell*, the wake-one patch is definitely doing its stuff: the 2.2.7 machine still has cycles of load into three figures, and the 2.3.3 machine hasn't actully managed a load of 1 yet.
UNFORTUNATELY, observation suggests that the 2.3.3 machine/Apache combination is dropping/ignoring about one connection in ten, maybe more. (Network error: connection reset by peer.)
More progress from the bleeding edge:(Reminder: the config here is split static/mod_perl httpd's, with a pretty CPU-intensive mod_perl script serving ads as an SSI as the probable bottleneck)
Linux kernel 2.2.9 plus the 2.2.9_andrea3 (wake-one) patch seems to do the trick: can handle hits at a speed which suggests it's pushing the adverser to close to its observed maximum. (As I said in a previous note, avoid 2.2.8 like the plague: it trashes HDs - see threads on linux-kernel for details.)
However... When it *does* get overstressed, BOY does it get overstressed. Once the idle CPU drops to zero (i.e. its spending most of its time processing advert requests, everything goes unpleasantly pearshaped, with a load of 400+, and the number of httpd's on both types of server *well* above MaxClients (in fact, suspiciously close to MaxClients + MinSpareServers). Spikes in demand can cause this, and once you get into this state, getting out again under the load of prgressively more backlogging requests is not easy: in fact from experience the only way is to which the machine out of the (hopefully short TTL) DNS round-robin while it dies down.
The potentially counterintuitive step at this point is to *REDUCE* MaxClients, and hope that the tcp Listen queue will handle a load surge. Experience suggests this does in fact work.
(Aside: this is a perfect case for using something like Eddieware's load balancing DNS).
... I'm having some big problems in which it appears that a single PII 400Mhz or a single AMD 400 will outrun a dual PII 450 at http requests from Apache. ...I advised him to try 2.2.9_andrea3; he said he'd try it and report back.Data for HTTP Server Tests: 100 1MByte mpeg files stored on local disks. Results:
- AMD 400Mghz K6, 128MB, Linux 2.0.36; handles 1000 simultaneous clients @ 57.6Kbits/sec.
- PII 400Mghz, 512MB, Linux 2.0.36; handles 1000 simultaneous clients @ 57.6Kbits/sec.
- Dual PII/450Mghz, 512MB, Linux 2.0.36 and 2.2.8; handles far less than 300 simultaneous @57.6Kbits/sec [and even then, clients were seeing 5 second connection delays, and 'reset by peer' and 'connection time out' errors]
See also comments on reducing interrupts under heavy load by Steve Underwood and Steven Guo.
See also Linus's "State of Linux" talk at Usenix '99 where he talks about the Mindcraft benchmark and SMP scalability.
See also SCT's Jan 2000 comments about progress in scalability.
Softnet is coming! Kernel 2.3.43 adds the new softnet networking changes. Softnet changes the interface to the networking cards, so every single driver needs updating, but in return network performance should scale much better on large SMP systems. (For more info, see Alexy's readme.softnet, his softnet-howto, or his Feb 15 post about how to convert old drivers.)
The Feb '00 thread Gigabit Ethernet Bottlenecks (especially its second week) has lots of interesting tidbits about how what interrupt (and other) bottlenecks remain, and how they are being addressed in the 2.3 kernel.
Ingo Molnar's post of 27 Feb 2000 describes interrupt-handling improvements in the IA32 code in great detail. These will be moving into the core kernel in 2.5, it seems.
After upgrading to the 2.2 series we have from time to time experienced severe slow-downs on the TCP performance... The performance goes back to normal when I take down the interface and reinsert the eepro100 module into the kernel. After I've done that, the performance is fine for a couple of days or maybe weeks.
I've got 3 machines running 2.2.10 [with multiple] 3COM 3C905/905b PCI [cards]... After approximately 2 days of uptime, I will start to see ping times on the local lan jump to 7-20 seconds. As others have noticed, there is no loss -- just some damn high latency. ... It seems to be dependant upon the network load -- lighter loads lead to longer periods between problems. The problem ALSO is gradual -- it'll start at 4 second pings, then 7 second pings about 20 minutes later, than 30 minutes later it's up to 12-20 seconds.
What DID fix the problem was a private reply from someone elese (sorry about the credit, but i'm not in the mood to sieve 10k emails right now), to try the alpha version of the latest 3c59x.c driver from Donald Becker (http://cesdis.gsfc.nasa.gov/linux/drivers/vortex.html).
3c59x.c:v0.99L 5/28/99 is the version that fixed it, from ftp://cesdis.gsfc.nasa.gov/pub/linux/drivers/test/3c59x.c
Kernel issue #7: Scheduler
Phil Ezolt, 22 Jan 2000, in linux-kernel (
Re: Interesting analysis of linux kernel threading by IBM):
When I run SPECWeb96 tests here, I see both a large number of running process and a huge number of context switches. ... Here's a sample of the vmstat data:[Phil's data is for the web server undergoing the SPECWeb96 test, which is an ES40 4 CPU alpha EV6 running Redhat 6.0 w/kernel v2.2.14 and Apache-v1.3.9 w/SGI performance patches; the interfaces receiving the load are two ACENic gigabit ethernet cards.]procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id ... 24 0 0 2320 2066936 590088 1061464 0 0 0 0 8680 7402 3 96 1 24 0 0 2320 2065752 590664 1061464 0 0 0 1095 11344 10920 3 95 1Notice. 24 running process and ~7000 context switches.That is a lot of overhead. Every second, 7000*24 goodnesses are calculated. Not the (20*3) that a desktop system sees. This is a scalability issue. A better scheduler means better scalability.
Don't tell me benchmark data is useless. Unless you can give me data using a real system and where it's faults are, benchmark data is all we have.
SPECWeb96 pushes Linux until it bleeds. I'm telling you where it bleeds. You can fix it or bury your head in the sand. It might not be what your system is seeing today, but it will be in the future.
Would you rather fix it now or wait until someone else how thrown down the performance gauntelet?
...
Here's a juicy tidbit. During my runs, I see 98% contention on the [2.2.14] kernel lock, and it is accessed ALOT. I don't know how 2.3.40 compares, because I don't have big memory support for it. Hopefully, Andrea will be kind enough give me a patch, and then I can see if things have improved.
Kernel issue #8: SMP Bottlenecks in 2.4 kernel
Manfred Spraul, April 21, 2000, in linux-kernel (
[PATCH] f_op->poll() without lock_kernel()):
kumon@flab.fujitsu.co.jp noticed that select() caused a high contention for the kernel lock, so here is a patch that removes lock_kernel() from poll(). [tested] with 2.3.99-pre5.There was some discussion about whether this was wise at this late date, but Linus and David Miller were enthusiastic. Looks like one more bottleneck bites the dust.
On 26 April 2000, kumon@flab.fujitsu.co.jp posted benchmark results in Linux-Kernel with and without the lock_kernel() in poll(). The followups included a kernel patch to improve checksum performance and a patch for Apache 1.3 to force it to align its buffers to 32-word boundaries. The latter patch, by Dean Gaudet, earned praise from Linus, who relayed rumors that this can speed up SPECWeb results by 3%. This was an interesting thread.
See also LWN's coverage, and the paragraph below, in which Kumon presents some benchmark results and another patch.
Kernel issue #9: csum_partial_copy_generic
kumon@flab.fujitsu.co.jp, 19 May 2000, in linux-kernel (
[PATCH] Fast csum_partial_copy_generic and more
) reports a 3% reduction in total CPU time compared to 2.3.99-pre8
on i686 by optimizing the cache behavior of csum_partial_copy_generic.
The workload was ZD's WebBench.
He adds
The benchmark we used has almost same setting as the MINDCRAFT ones, but the apache setting is [changed] slightly not to use symlink checking.Note that in ZD's benchmarks with 2.2.6, a 4 CPU system only achieved a 1.5x speedup over a single CPU. Kumon is reporting a > 2x speedup. This appears to be about the same speedup NT 4.0sp3 achieved with 4 CPUs at that number of clients (24). It's encouraging to hear that things may have improved in the 11 months since the 2.2.6 tests. When I asked him about this, Kumon saidWe used maximum of 24 independent clients and number of apache processes is 16. A four-way XEON procesor system is used, and the performance is twice and more than a single CPU performance.
Major improvement is between pre3 and pre5, poll optimization. Until pre4 (I forget exact version), kernel-lock prevents performance improvement.If you can retrieve l-k mails around Apr 20-25, the following mails will help you understand the background.
subject: namei() query subject: [PATCH] f_op->poll() without lock_kernel() subject: lockless poll() (was Re: namei() query) subject: "movb" for spin-unlock (was Re: namei() query)
On 4 Sept 2000, kumon posted again,
noting that his change still hadn't made it into the kernel.
Kernel issue #10: getname(), poll() optimizations
On 22 May 2000, Manfred Spraul
posted a patch on linux-kernel
which optimized kmalloc(), getname(), and select() a bit, speeding up apache
by about 1.5% on 2.3.99-pre8.
Kernel issue #11: Reducing lock contention, poll overhead in 2.4
On 30 May 2000,
Alexander Viro posted a patch
that got rid of a big lock in close_flip() and _fput(), and asked for testing.
kumon ran a benchmark, and reported:
I measured viro's ac6-D patch with WebBench on 4cpu Xeon system. I applied to 2.4.0-test1 not ac6. The patch reduced 50% of stext_lock time and 4% of the total OS time. ... Some part of kmalloc/kfree overhead is come from do_select, and it is easily eliminated using small array on a stack.kumon then posted a patch that avoids kmalloc/kfree in select() and poll() when # of fd's involved is under 64.
The more recent test4 and test5pre2 don't fair quite so well. They handle 2 clients on a 128 Meg server fine, so they're doing better than 2.2 but they choke and go seek bound with 4 clients. So something has definitely taken a turn for the worse since test1-ac22.Here's an update. The *only* 2.4 kernel versions that could handle 5 clients were 2.4.0-test1-ac22-riel and 2.4.0-test1-ac22-class 5+; everything before and after (up to 2.4.0-test5pre4) can only handle 2.
On 26 Sept 2000, Robert Cohen posted an update which included a simple program to demonstrate the problem, which appears to be in the elevator code. Jens Axboe (axboe@suse.de) responded that he and Andrea had a patch almost ready for 2.4.0-test9-pre5 that fixes this problem.
On 4 Oct 2000, Robert Cohen posted
an update
with benchmark results for many kernels, showing that the problem still exists in
2.4.0-test9.
Kernel issue #13: Fast Forwarding / Hardware Flow Control
On 18 Sept 2000, Jamal (hadi@cyberus.ca)
posted a note in Linux-kernel
describing proposed changes to the 2.4 kernel's network driver interface;
the changes add hardware flow control and several other refinements. He says
Robert Olson and I decided after the OLS that we were going to try to hit the 100Mbps(148.8Kpps) routing peak by year end. I am afraid the bar has been raised. Robert is already hitting with 2.4.0-test7 ~148Kpps with a ASUS CUBX motherboard carrying PIII 700 MHZ coppermine with about 65% CPU utilization. With a single PII based Dell machine i was able to get a consistent value of 110Kpps.So the new goal is to go to about 500Kpps ;-> (maybe not by year end, but surely by that next random Linux hacker conference)
A sample modified tulip driver (hacked by Alexey for 2.2 and mod'ed by Robert and myself over a period of time) is supplied as an example on how to use the feedback values. ...
I believe we could have done better with the mindcraft tests with these changes in 2.2 (and HW FC turned on).
[update] BTW, I am informed that Linux people were _not_ allowed to change the hardware for those tests, so I dont think they could have used these changes if they were available back then.
echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_rangeto avoid bumping into this situation when trying to simulate large numbers of clients with a small number of client machines.
On 2 April 2000, Mr. Horikawa confirmed that increasing the local port range with the above command solved the problem.
Suggestions for future benchmarks
Become familliar with linux-kernel
and the Apache mailing lists as well as the Linux newsgroups on Usenet (try
DejaNews power searches
in forums matching '*linux*').
Post your proposed configuration and see whether people agree with it.
Also, be open about your benchmark; post intermediate results, and
see if anyone has suggestions for improvements. You should probably
expect to spend a week or so mulling over ideas with these mailing lists
during the course of your tests.
If possible, use a modern benchmark like SPECWeb99 rather than the simple ones used by Mindcraft.
It might be interesting to inject latency into the path between the server and the clients to more realistically model the situation on the Internet.
Benchmark both single and multiple CPUs, and single and multiple Ethernet interfaces, if possible. Be aware that the networking performance of version 2.2.x of the Linux kernel does not scale well as you add more CPUs and Ethernet cards. This applies mostly to static pages and cached dynamic pages; noncached dynamic pages usually take a fair bit of CPU time, and should scale very well when you add CPUs. If possible, use a cache to save commonly generated pages; this will bring the dynamic page speeds closer to the static page speeds.
When testing dynamic content: Don't use the old model of running a separate process for each request; nobody running a big web site uses that interface anymore, as it's too slow. Always use a modern dynamic content generation interface (e.g. mod_perl for Apache).
Here are some notes if you want to see what people going for the utmost were trying in June:
If you are going to do bench I would like if you would bench also the patch below.
ftp://e-mind.com/pub/andrea/kernel-patches/2.2.9_andrea-perf1.gz
env CFLAGS='-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT' ./configure
find /www/htdocs -type f -print | sed -e 's/.*/mmapfile &/' > mmap.confand including mmap.conf in your Apache config file.