Very quick and dirty benchmark to test new Linux server performance:
[root@centos run]# cat zbench.linux
#!/bin/ksh
# wget http://www.bitmover.com/lmbench/lmbench3.tar.gz
# ps -e -o "pid,ppid,user,group,cpu,time,etime,pcpu,vsz,thcount,args" | sort -nrk 8 | head
# wget http://www.bitmover.com/lmbench/lmbench3.tar.gz
# ps -e -o "pid,ppid,user,group,cpu,time,etime,pcpu,vsz,thcount,args" | sort -nrk 8 | head
# Quick and dirty bc benchmark
(
time ksh loopbench 1000 ./command1
echo "===================================================================================="
for i in {1..20}; do cat lmbench-3.0-a9.tar >> big; done
time ksh loopbench 100 ./command2
rm big
echo "===================================================================================="
# Stream - Memory bandwidth benchmark
stream/stream
echo "===================================================================================="
# LMbench - low level benchmark
lmbench-3.0-a9/bin/i686-pc-linux-gnu/mhz
lmbench-3.0-a9/bin/i686-pc-linux-gnu/stream
lmbench-3.0-a9/bin/i686-pc-linux-gnu/stream -P 10
lmbench-3.0-a9/bin/i686-pc-linux-gnu/lat_mem_rd 512
echo "===================================================================================="
cd javabench; ksh -x run.sh
) 2>&1 | tee zbench.`hostname`
[root@centos run]#
[root@centos run]# cat loopbench
#!/bin/ksh
date
x=1
while [ $x -le $1 ]
do
(
$2 $3 $4 $5 $6 $7 $8 $9 &
)
let x=$x+1done
echo Waiting...
wait
date
[root@centos run]#
[root@centos run]# cat command1
bc > /dev/null <<!
sqrt(1234^1234)
!
[root@centos run]#
[root@centos run]# cat command2
gzip -c big > /dev/null
[root@centos run]#
[root@centos run]# cat javabench/run.sh
time java -server -Xmx512M ackermann 13
time java -server fibo 45
time java -server hash2 3000
time java -server -Xmx512M hash 3000000
time java -server -Xmx512m heapsort 10000000
time java -server matrix 100000
time java -server methcall 1000000000
time java -server nestedloop 45
time java -server objinst 100000000
time java -server random 300000000
time java -server sieve 100000;
time java -server -Xmx450m strcat 10000000
[root@centos run]#
Sample run from a VM on HP N40L with 2 CPU cores:
[root@centos run]# cat zbench.centos.vm_on_n40l-2c
Tue May 29 15:57:57 EST 2012
Waiting...
Tue May 29 16:05:40 EST 2012
real 7m43.09s
user 0m0.13s
sys 0m1.23s
====================================================================================
Tue May 29 16:05:53 EST 2012
Waiting...
Tue May 29 16:06:42 EST 2012
real 0m49.47s
user 0m0.02s
sys 0m0.15s
====================================================================================
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 17870 microseconds.
(= 17870 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 1678.2010 0.0195 0.0191 0.0199
Scale: 1643.5473 0.0231 0.0195 0.0482
Add: 1895.8864 2.2737 0.0253 5.2402
Triad: 1777.4551 2.2593 0.0270 5.0704
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
====================================================================================
1482 MHz, 0.6748 nanosec clock
STREAM copy latency: 4.30 nanoseconds
STREAM copy bandwidth: 3718.93 MB/sec
STREAM scale latency: 4.27 nanoseconds
STREAM scale bandwidth: 3750.79 MB/sec
STREAM add latency: 9.55 nanoseconds
STREAM add bandwidth: 2513.00 MB/sec
STREAM triad latency: 8.86 nanoseconds
STREAM triad bandwidth: 2708.24 MB/sec
STREAM copy latency: 33.81 nanoseconds
STREAM copy bandwidth: 4733.02 MB/sec
STREAM scale latency: 33.75 nanoseconds
STREAM scale bandwidth: 4741.35 MB/sec
STREAM add latency: 55.61 nanoseconds
STREAM add bandwidth: 4316.11 MB/sec
STREAM triad latency: 51.38 nanoseconds
STREAM triad bandwidth: 4671.42 MB/sec
"stride=128
0.00049 2.053
0.00098 2.054
0.00195 2.054
0.00293 2.053
0.00391 2.052
0.00586 2.039
0.00781 2.038
0.01172 2.038
0.01562 2.039
0.02344 2.036
0.03125 2.038
0.04688 2.040
0.06250 2.042
0.09375 10.448
0.12500 10.452
0.18750 10.456
0.25000 10.470
0.37500 10.464
0.50000 10.476
0.75000 18.414
1.00000 31.156
1.50000 43.502
2.00000 49.361
3.00000 49.480
4.00000 49.587
6.00000 50.297
8.00000 50.249
12.00000 50.312
16.00000 51.070
24.00000 50.377
32.00000 50.364
48.00000 49.446
64.00000 49.434
96.00000 50.209
128.00000 50.186
192.00000 49.416
256.00000 49.358
384.00000 49.380
512.00000 49.316
====================================================================================
+ bin/java -server -Xmx512M ackermann 13
real 0m1.27s
user 0m0.75s
sys 0m0.08s
+ bin/java -server fibo 45
1836311903
real 0m16.26s
user 0m15.98s
sys 0m0.04s
+ bin/java -server hash2 3000
1 9999 3000 29997000
real 0m6.78s
user 0m6.67s
sys 0m0.05s
+ bin/java -server -Xmx512M hash 3000000
299999
real 0m18.71s
user 0m21.32s
sys 0m1.63s
+ bin/java -server -Xmx512m heapsort 10000000
0.9999928555
real 0m13.83s
user 0m13.14s
sys 0m0.23s
+ bin/java -server matrix 100000
270165 1061760 1453695 1856025
real 0m14.90s
user 0m14.59s
sys 0m0.03s
+ bin/java -server methcall 1000000000
true
false
real 0m2.80s
user 0m2.64s
sys 0m0.05s
+ bin/java -server nestedloop 45
-286168967
real 0m20.02s
user 0m19.45s
sys 0m0.04s
+ bin/java -server objinst 100000000
false
true
false
true
false
true
true
false
false
false
true
true
true
real 0m2.60s
user 0m2.32s
sys 0m0.21s
+ bin/java -server random 300000000
92.485425240
real 0m18.52s
user 0m22.21s
sys 0m8.99s
+ bin/java -server sieve 100000
Count: 1028
real 0m13.41s
user 0m13.05s
sys 0m0.03s
+ bin/java -server -Xmx450m strcat 10000000
60000000
real 0m1.95s
user 0m1.20s
sys 0m0.71s
Can run on other UNIX platforms - Sample run from a Sun M5000 with 32 CPU cores:
[root@centos run]# cat bnech.vfecos031.m5000-32c
Wednesday, 3 March 2010 2:06:17 PM EST
Waiting...
Wednesday, 3 March 2010 2:06:31 PM EST
====================================================================================
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 7607 microseconds.
(= 7607 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 2851.0255 0.0113 0.0112 0.0114
Scale: 2812.7274 0.0114 0.0114 0.0114
Add: 2915.6217 0.0165 0.0165 0.0165
Triad: 2951.3104 0.0163 0.0163 0.0164
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
=== Warmup 0
PID 0 begins to allocate (0 secs since last report)
PID 0 checks time quanta (0 secs since last report)
PID 0 begins to init array (0 secs since last report)
PID 0 waits on lock (0 secs since last report)
PID 0 starts main loop (0 secs since last report)
PID: 0 Copy: 6122.056921308876MB/sec Scale: 3389.4736478254817MB/sec Add: 6422.2609611035305MB/sec Triad: 5877.310779371098MB/sec
PID 0 exits (0 secs since last report)
=== Warmup 1
PID 1 begins to allocate (0 secs since last report)
PID 1 checks time quanta (0 secs since last report)
PID 1 begins to init array (0 secs since last report)
PID 1 waits on lock (0 secs since last report)
PID 1 starts main loop (0 secs since last report)
PID: 1 Copy: 8034.13325352139MB/sec Scale: 7171.66593641464MB/sec Add: 9314.964883724713MB/sec Triad: 8413.668242323325MB/sec
PID 1 exits (0 secs since last report)
=== Warmup 2
PID 2 begins to allocate (0 secs since last report)
PID 2 checks time quanta (0 secs since last report)
PID 2 begins to init array (0 secs since last report)
PID 2 waits on lock (0 secs since last report)
PID 2 starts main loop (0 secs since last report)
PID: 2 Copy: 7972.0970691415305MB/sec Scale: 7009.859652401344MB/sec Add: 8935.22132819245MB/sec Triad: 8572.960130940737MB/sec
PID 2 exits (0 secs since last report)
=== Warmup 3
PID 3 begins to allocate (0 secs since last report)
PID 3 checks time quanta (0 secs since last report)
PID 3 begins to init array (0 secs since last report)
PID 3 waits on lock (0 secs since last report)
PID 3 starts main loop (0 secs since last report)
PID: 3 Copy: 8117.707266064842MB/sec Scale: 6927.911341415957MB/sec Add: 8692.506164743978MB/sec Triad: 8738.40000813827MB/sec
PID 3 exits (0 secs since last report)
=== main waits 0 secs, seen 0/1 reach barrier
PID 0 begins to allocate (0 secs since last report)
PID 0 checks time quanta (0 secs since last report)
PID 0 begins to init array (0 secs since last report)
PID 0 waits on lock (0 secs since last report)
=== Go!
PID 0 starts main loop (0 secs since last report)
PID: 0 Copy: 8022.053429710773MB/sec Scale: 7797.282360443241MB/sec Add: 8867.532703276056MB/sec Triad: 8941.8872751499MB/sec
PID 0 exits (0 secs since last report)
=== Caught the last thread.
Average cpu bandwidth: Copy: 8022MB/sec/cpu Scale: 7797MB/sec/cpu Add: 8867MB/sec/cpu Triad: 8941MB/sec/cpu
Total system bandwidth: Copy: 8022MB/sec Scale: 7797MB/sec Add: 8867MB/sec Triad: 8941MB/sec
====================================================================================
2403 MHz, 0.4161 nanosec clock
STREAM copy latency: 5.01 nanoseconds
STREAM copy bandwidth: 3195.53 MB/sec
STREAM scale latency: 4.89 nanoseconds
STREAM scale bandwidth: 3273.32 MB/sec
STREAM add latency: 8.26 nanoseconds
STREAM add bandwidth: 2907.33 MB/sec
STREAM triad latency: 8.71 nanoseconds
STREAM triad bandwidth: 2754.50 MB/sec
STREAM copy latency: 13.39 nanoseconds
STREAM copy bandwidth: 11949.94 MB/sec
STREAM scale latency: 14.75 nanoseconds
STREAM scale bandwidth: 10844.01 MB/sec
STREAM add latency: 20.55 nanoseconds
STREAM add bandwidth: 11680.69 MB/sec
STREAM triad latency: 20.98 nanoseconds
STREAM triad bandwidth: 11442.06 MB/sec
"stride=128
0.00049 1.665
0.00098 1.665
0.00195 1.665
0.00293 1.665
0.00391 1.665
0.00586 1.665
0.00781 1.665
0.01172 1.665
0.01562 1.665
0.02344 1.665
0.03125 1.665
0.04688 1.665
0.06250 1.665
0.09375 14.523
0.12500 14.548
0.18750 14.493
0.25000 14.493
0.37500 14.559
0.50000 14.559
0.75000 14.556
1.00000 14.556
1.50000 14.556
2.00000 14.649
3.00000 14.556
4.00000 14.740
6.00000 189.885
8.00000 191.609
12.00000 189.892
16.00000 192.148
24.00000 190.483
32.00000 192.763
48.00000 190.542
64.00000 192.617
96.00000 192.669
128.00000 190.633
192.00000 192.765
256.00000 190.715 <- Expensive machine, slow memory!!!
384.00000 190.695
512.00000 190.719
====================================================================================
ackermann
real 0m0.19s
user 0m0.07s
sys 0m0.18s
fibo
real 0m0.19s
user 0m0.07s
sys 0m0.18s
hash2
real 0m0.19s
user 0m0.07s
sys 0m0.18s
hash
real 0m0.18s
user 0m0.07s
sys 0m0.15s
heapsort
real 0m0.18s
user 0m0.07s
sys 0m0.17s
matrix
real 0m0.21s
user 0m0.07s
sys 0m0.16s
methcall
real 0m0.19s
user 0m0.07s
sys 0m0.16s
nestedloop
real 0m0.19s
user 0m0.07s
sys 0m0.17s
objinst
real 0m0.20s
user 0m0.07s
sys 0m0.17s
random
real 0m0.20s
user 0m0.07s
sys 0m0.16s
sieve
real 0m0.20s
user 0m0.07s
sys 0m0.16s
strcat
real 0m0.19s
user 0m0.07s
sys 0m0.17s
Update 16/4/2015 - Simple CPU benchmark
dd if=/dev/zero bs=1M count=1024 | md5sum
UnixBench is trivial to download and compile it:
wget http://byte-unixbench.googlecode.com/files/UnixBench5.1.3.tgz
tar zxf ./UnixBench5.1.3.tgz
cd ./UnixBench
./Run
The tests take a while to finish. The output looks like
------------------------------------------------------------------------
Benchmark Run: Mon Oct 15 2012 23:55:22 - 00:23:16
4 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 12015218.4 lps (10.0 s, 7 samples)
Double-Precision Whetstone 2214.8 MWIPS (10.1 s, 7 samples)
Execl Throughput 896.9 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 58968.3 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 14578.6 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 422068.2 KBps (30.0 s, 2 samples)
Pipe Throughput 70993.3 lps (10.0 s, 7 samples)
Pipe-based Context Switching 16001.5 lps (10.0 s, 7 samples)
Process Creation 1861.8 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 2525.5 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 737.8 lpm (60.1 s, 2 samples)
System Call Overhead 432496.2 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 12015218.4 1029.6
Double-Precision Whetstone 55.0 2214.8 402.7
Execl Throughput 43.0 896.9 208.6
File Copy 1024 bufsize 2000 maxblocks 3960.0 58968.3 148.9
File Copy 256 bufsize 500 maxblocks 1655.0 14578.6 88.1
File Copy 4096 bufsize 8000 maxblocks 5800.0 422068.2 727.7
Pipe Throughput 12440.0 70993.3 57.1
Pipe-based Context Switching 4000.0 16001.5 40.0
Process Creation 126.0 1861.8 147.8
Shell Scripts (1 concurrent) 42.4 2525.5 595.6
Shell Scripts (8 concurrent) 6.0 737.8 1229.7
System Call Overhead 15000.0 432496.2 288.3
========
System Benchmarks Index Score 249.7
------------------------------------------------------------------------
Benchmark Run: Tue Oct 16 2012 00:23:16 - 00:51:20
4 CPUs in system; running 4 parallel copies of tests
Dhrystone 2 using register variables 42619039.2 lps (10.0 s, 7 samples)
Double-Precision Whetstone 8274.0 MWIPS (10.4 s, 7 samples)
Execl Throughput 3398.5 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 68332.4 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 21462.9 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 718205.6 KBps (30.0 s, 2 samples)
Pipe Throughput 149713.5 lps (10.0 s, 7 samples)
Pipe-based Context Switching 61968.3 lps (10.0 s, 7 samples)
Process Creation 5321.7 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 5957.1 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 812.6 lpm (60.1 s, 2 samples)
System Call Overhead 1557391.5 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 42619039.2 3652.0
Double-Precision Whetstone 55.0 8274.0 1504.4
Execl Throughput 43.0 3398.5 790.4
File Copy 1024 bufsize 2000 maxblocks 3960.0 68332.4 172.6
File Copy 256 bufsize 500 maxblocks 1655.0 21462.9 129.7
File Copy 4096 bufsize 8000 maxblocks 5800.0 718205.6 1238.3
Pipe Throughput 12440.0 149713.5 120.3
Pipe-based Context Switching 4000.0 61968.3 154.9
Process Creation 126.0 5321.7 422.4
Shell Scripts (1 concurrent) 42.4 5957.1 1405.0
Shell Scripts (8 concurrent) 6.0 812.6 1354.3
System Call Overhead 15000.0 1557391.5 1038.3
========
System Benchmarks Index Score 592.5
Which means that the VM in question has a score of 249.7 for single task and 592.5 for parallel processing.
Another important metric is network speed:
wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash
CPU model : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Number of cores : 1 CPU frequency : 2600.000 MHz Total amount of ram : 1869 MB Total amount of swap : 1983 MB System uptime : 34 min, Download speed from CacheFly: 13.6MB/s Download speed from Coloat, Atlanta GA: 8.09MB/s Download speed from Softlayer, Dallas, TX: 7.38MB/s Download speed from Linode, Tokyo, JP: 1.17MB/s Download speed from i3d.net, Rotterdam, NL: 3.61MB/s Download speed from Leaseweb, Haarlem, NL: 176KB/s Download speed from Softlayer, Singapore: 18.0MB/s Download speed from Softlayer, Seattle, WA: 9.12MB/s Download speed from Softlayer, San Jose, CA: 9.77MB/s Download speed from Softlayer, Washington, DC: 8.37MB/s I/O speed : 439 MB/s
No comments:
Post a Comment