I would never have guessed that my laptop would beat them all.
Does this mean that Linux x32 is the fastest OS in the world, and that 32-bit is faster than 64-bit, and that clockspeed means everything, and dual cores are better than quad and single cores?
I want to find out what is the fastest linear and parallel processing system on the market, and I really hope M5 will be it.
| System ID | M1 | M2 | M3 | M4 | M5 | M6 | M7 | M8 |
| System Name | IBM System p5 505 | IBM System x 3650 | Lenovo ThinkPad T61p | Siipi Blackhawk | IBM System p 570 | IBM System x 346 | Siipi Falcon | Siipi Osprey |
| System Specifications | POWER5
1.9GHz | 2x Intel Xeon Quad Core
2.66GHz | Intel Dual Core
2.66GHz | Intel Quad Core
2.4GHz | 4x POWER6
4.7GHz | 2x Intel Xeon 2.4GHz | Intel Pentium 4 3.2GHz | Intel Dual Core 1.8GHz |
| Operating System | AIX 5L 5.3 | Windows 2003 Enterprise R2 x64 | Windows XP SP2 x32 | Windows XP SP2 x32 / Linux Ubuntu 7.10 x32 | AIX 6.1 | Windows 2003 Enterprise x32 | Windows 2003 Enterprise x32 | Windows XP SP2 x32 |
| Speedtest 1.0 32-bit | - | 25.125s | 26.640s (14.312s+12.328s) | 27.781s (Windows)
48.010s (Linux, default)
21.590s (Linux, optimized 16.660s+4.930s) | - | 58.406s | 45.312s | 38.296s |
| Speedtest 1.0 64-bit | 199.391s (default)
51.450s (optimized) | 122.937s | - | - | coming in 2008-03 | - | - | - |
Maybe I should explain a bit further why these systems were used in the test:
M1: the cheapest System p you can get, we wanted to have a test server for evaluating the production server (possibly JS22 or 570)
M2: our currently fastest System x server
M3: my office laptop
M4: my home gaming PC
M5: an vision how our first production System p could look
M6: our old production server
M7: a low cost server (high GHz single HyperThreading core (=semi-dual core))
M8: a low cost gaming PC (low GHz dual core)
Speedtest 1.0 source code (all-bit, all platforms) is here:
#include "stdio.h"
#include "time.h"
class testclass
{
public:
int x;
testclass(void) {
x=1;
}
~testclass(void) {
x=0;
}
};
int main(int argc, char **argv)
{
double n=0;
long i=0;
long t1=0;
long t2=0;
long t1t=0;
long t2t=0;
t1t=clock();
printf("Speedtest 1.0 (c) 2008 Siipi\n");
printf("Counting 10 billion floating points...\n");
t1=clock();
while(n<100000.0)
{
n+=0.00001;
i++;
}
t2=clock();
printf("Done. i=%ld, n=%f, time=%fs.\n",i,n,
(double)(t2-t1)/CLOCKS_PER_SEC);
printf("Creating and deleting 1 billion class objects...\n");
t1=clock();
i=0;
while(i<100000000)
{
testclass *a=new testclass();
delete(a);
i++;
}
t2=clock();
printf("Done. i=%ld, time=%fs.\n",i,
(double)(t2-t1)/CLOCKS_PER_SEC);
t2t=clock();
printf("Total time=%fs.\n",(double)(t2t-t1t)/CLOCKS_PER_SEC);
return(0);
} |
To compile and run Speedtest 1.0 under AIX/Linux/MacOSX/Sun/Cray/IRIX/AmigaOS, enter the following commands:
g++ st.cpp
time ./a.out
To compile and run Speedtest 1.0 under Windows, enter the following commands:
(Launch Visual Studio 2005 C++, select: File/New/Project From Existing Code/Console Application)
(Select Release instead of Debug)
(Choose Build/Build Solution)
\programs\rktools\ntimer release\st.exe (ntimer.exe comes with the Windows 2003 Resource Kit Tools)
You don't necessarily need to run Speedtest 1.0 with time/ntimer, I just used it to double-check that my time measurement was working fine.
The program output will look like this:
Speedtest 1.0 (c) 2008 Siipi
Counting 10 billion floating points...
Done. i=1410063201, n=100000.000003, time=15.078000s.
Creating and deleting 1 billion class objects...
Done. i=100000000, time=12.703000s.
Total time=27.781000s.
Unfortunately I don't have all systems which exist, but here are some precompiled binaries:
The source code can be also downloaded, to make it a bit easier on systems which have no web browser installed: ftp://ftp.siipi.com/st.cpp.
I would be glad to know if someone's server/workstation can run this under 21s.
I managed to get the POWER5 code twice faster with the following compiler options:
g++ -O3 -fomit-frame-pointer -fstrict-aliasing -mcpu=power5 st.cpp
The Linux code went also more than twice faster with these options:
g++ -O3 -mtune=pentium4 st.cpp
Furthermore, the AIX programs might also run faster when compiled with IBM VisualAge AIX 6.0 C++ (especially with the XL C/C++ addon) or Intel C++, although Linux GNU C++ beats Microsoft Visual Studio 2005 C++ in optimized machine code speed, which also surprised me, let's say vastly. |
|
| | T Monday, 07 January 2008 19:39:13 EET |
| | Http Status Code: 404
Reason: File not found or unable to read file
|
|
|
| | Mika Monday, 07 January 2008 19:43:39 EET |
| | Fixed the downloads, Notes table cells needed a cut/paste again...
|
|
|
| | issei Friday, 25 January 2008 23:26:30 EET |
| | How do you expect to find the relative performance of a dual core (or even quad core) system, with one process that isn't even threaded? Your results aren't surprising at all, since raw clock frequency obviously surpasses number of cores in single threaded cases.
|
|
|
| | Mika Thursday, 31 January 2008 01:20:54 EET |
| | Yeah, that is the common counter argument against speed tests. My speed test obviously is mostly dependant of raw CPU clock speed, but the fact is that many programs work that way. Of course there are recommendations to use multi-threading when writing programs, but a big number of programs are still single threaded, thus running fastest on a high clock CPU. However, the reality is not that black and white, since the OS can also run one single thread program on one core, and another one on another core, and when speaking of AIX 6.1, it can even run the single thread programs on virtual cores, not limited or sliced by physical CPU borders.
|
|
|
| | Sebastian Sunday, 08 June 2008 00:23:56 EET |
| | This test doesn't mean anything apart from speed and size of processor caches. So you will not get anything from it.
The first loop is working on most super scalar processors even within the execution pipline, while the second does not thrash the processor cach in any way, so it does not mean anythnig after all as well.
Sorry, but to make something that is performance related, you will have to use some more sophisticated testing to become an idea about the over all system performance, since real world applications will not only depend on some very few hundert kbytes processor cache after all.
|
|
|