Sunday, 10 November 2013

Simple Linux benchmark



Very quick and dirty benchmark to test new Linux server performance:


[root@centos run]# cat zbench.linux
#!/bin/ksh
# wget http://www.bitmover.com/lmbench/lmbench3.tar.gz
# ps -e -o "pid,ppid,user,group,cpu,time,etime,pcpu,vsz,thcount,args" | sort -nrk 8 | head
# Quick and dirty bc benchmark
(
time ksh loopbench 1000 ./command1
echo "===================================================================================="
for i in {1..20}; do cat lmbench-3.0-a9.tar >> big; done
time ksh loopbench 100 ./command2
rm big
echo "===================================================================================="
# Stream - Memory bandwidth benchmark
stream/stream
echo "===================================================================================="
# LMbench - low level benchmark
lmbench-3.0-a9/bin/i686-pc-linux-gnu/mhz
lmbench-3.0-a9/bin/i686-pc-linux-gnu/stream
lmbench-3.0-a9/bin/i686-pc-linux-gnu/stream -P 10
lmbench-3.0-a9/bin/i686-pc-linux-gnu/lat_mem_rd 512
echo "===================================================================================="
cd javabench; ksh -x run.sh
) 2>&1 | tee zbench.`hostname`
[root@centos run]#




[root@centos run]# cat loopbench
#!/bin/ksh
date
x=1
while [ $x -le $1 ]
do
(
$2 $3 $4 $5 $6 $7 $8 $9 &
)
let x=$x+1
done
echo Waiting...
wait
date
[root@centos run]#



[root@centos run]# cat command1
bc > /dev/null <<!
sqrt(1234^1234)
!
[root@centos run]#


[root@centos run]# cat command2
gzip -c big > /dev/null
[root@centos run]#



[root@centos run]# cat javabench/run.sh
time java -server  -Xmx512M ackermann 13
time java -server  fibo 45
time java -server  hash2 3000
time java -server -Xmx512M  hash 3000000
time java -server -Xmx512m  heapsort 10000000
time java -server  matrix 100000
time java -server  methcall 1000000000
time java -server  nestedloop 45
time java -server  objinst 100000000
time java -server  random 300000000
time java -server  sieve 100000;
time java -server -Xmx450m strcat 10000000
[root@centos run]#





Sample run from a VM on HP N40L with 2 CPU cores:


[root@centos run]# cat zbench.centos.vm_on_n40l-2c
Tue May 29 15:57:57 EST 2012
Waiting...
Tue May 29 16:05:40 EST 2012


real    7m43.09s
user    0m0.13s
sys     0m1.23s
====================================================================================
Tue May 29 16:05:53 EST 2012
Waiting...
Tue May 29 16:06:42 EST 2012


real    0m49.47s
user    0m0.02s
sys     0m0.15s
====================================================================================
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 17870 microseconds.
  (= 17870 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        1678.2010       0.0195       0.0191       0.0199
Scale:       1643.5473       0.0231       0.0195       0.0482
Add:         1895.8864       2.2737       0.0253       5.2402
Triad:       1777.4551       2.2593       0.0270       5.0704
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
====================================================================================
1482 MHz, 0.6748 nanosec clock
STREAM copy latency: 4.30 nanoseconds
STREAM copy bandwidth: 3718.93 MB/sec
STREAM scale latency: 4.27 nanoseconds
STREAM scale bandwidth: 3750.79 MB/sec
STREAM add latency: 9.55 nanoseconds
STREAM add bandwidth: 2513.00 MB/sec
STREAM triad latency: 8.86 nanoseconds
STREAM triad bandwidth: 2708.24 MB/sec
STREAM copy latency: 33.81 nanoseconds
STREAM copy bandwidth: 4733.02 MB/sec
STREAM scale latency: 33.75 nanoseconds
STREAM scale bandwidth: 4741.35 MB/sec
STREAM add latency: 55.61 nanoseconds
STREAM add bandwidth: 4316.11 MB/sec
STREAM triad latency: 51.38 nanoseconds
STREAM triad bandwidth: 4671.42 MB/sec
"stride=128
0.00049 2.053
0.00098 2.054
0.00195 2.054
0.00293 2.053
0.00391 2.052
0.00586 2.039
0.00781 2.038
0.01172 2.038
0.01562 2.039
0.02344 2.036
0.03125 2.038
0.04688 2.040
0.06250 2.042
0.09375 10.448
0.12500 10.452
0.18750 10.456
0.25000 10.470
0.37500 10.464
0.50000 10.476
0.75000 18.414
1.00000 31.156
1.50000 43.502
2.00000 49.361
3.00000 49.480
4.00000 49.587
6.00000 50.297
8.00000 50.249
12.00000 50.312
16.00000 51.070
24.00000 50.377
32.00000 50.364
48.00000 49.446
64.00000 49.434
96.00000 50.209
128.00000 50.186
192.00000 49.416
256.00000 49.358
384.00000 49.380
512.00000 49.316
====================================================================================
+ bin/java -server -Xmx512M ackermann 13


real    0m1.27s
user    0m0.75s
sys     0m0.08s
+ bin/java -server fibo 45
1836311903


real    0m16.26s
user    0m15.98s
sys     0m0.04s
+ bin/java -server hash2 3000
1 9999 3000 29997000


real    0m6.78s
user    0m6.67s
sys     0m0.05s
+ bin/java -server -Xmx512M hash 3000000
299999


real    0m18.71s
user    0m21.32s
sys     0m1.63s
+ bin/java -server -Xmx512m heapsort 10000000
0.9999928555


real    0m13.83s
user    0m13.14s
sys     0m0.23s
+ bin/java -server matrix 100000
270165 1061760 1453695 1856025


real    0m14.90s
user    0m14.59s
sys     0m0.03s
+ bin/java -server methcall 1000000000
true
false


real    0m2.80s
user    0m2.64s
sys     0m0.05s
+ bin/java -server nestedloop 45
-286168967


real    0m20.02s
user    0m19.45s
sys     0m0.04s
+ bin/java -server objinst 100000000
false
true
false
true
false


true
true
false
false
false
true
true
true


real    0m2.60s
user    0m2.32s
sys     0m0.21s
+ bin/java -server random 300000000
92.485425240


real    0m18.52s
user    0m22.21s
sys     0m8.99s
+ bin/java -server sieve 100000
Count: 1028


real    0m13.41s
user    0m13.05s
sys     0m0.03s
+ bin/java -server -Xmx450m strcat 10000000
60000000


real    0m1.95s
user    0m1.20s
sys     0m0.71s
Can run on other UNIX platforms - Sample run from a Sun M5000 with 32 CPU cores:
[root@centos run]# cat bnech.vfecos031.m5000-32c
Wednesday,  3 March 2010  2:06:17 PM EST
Waiting...
Wednesday,  3 March 2010  2:06:31 PM EST
====================================================================================
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 7607 microseconds.
  (= 7607 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        2851.0255       0.0113       0.0112       0.0114
Scale:       2812.7274       0.0114       0.0114       0.0114
Add:         2915.6217       0.0165       0.0165       0.0165
Triad:       2951.3104       0.0163       0.0163       0.0164
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
=== Warmup 0
PID 0 begins to allocate (0 secs since last report)
PID 0 checks time quanta (0 secs since last report)
PID 0 begins to init array (0 secs since last report)
PID 0 waits on lock (0 secs since last report)
PID 0 starts main loop (0 secs since last report)
PID: 0 Copy: 6122.056921308876MB/sec  Scale: 3389.4736478254817MB/sec  Add: 6422.2609611035305MB/sec  Triad: 5877.310779371098MB/sec
PID 0 exits (0 secs since last report)
=== Warmup 1
PID 1 begins to allocate (0 secs since last report)
PID 1 checks time quanta (0 secs since last report)
PID 1 begins to init array (0 secs since last report)
PID 1 waits on lock (0 secs since last report)
PID 1 starts main loop (0 secs since last report)
PID: 1 Copy: 8034.13325352139MB/sec  Scale: 7171.66593641464MB/sec  Add: 9314.964883724713MB/sec  Triad: 8413.668242323325MB/sec
PID 1 exits (0 secs since last report)
=== Warmup 2
PID 2 begins to allocate (0 secs since last report)
PID 2 checks time quanta (0 secs since last report)
PID 2 begins to init array (0 secs since last report)
PID 2 waits on lock (0 secs since last report)
PID 2 starts main loop (0 secs since last report)
PID: 2 Copy: 7972.0970691415305MB/sec  Scale: 7009.859652401344MB/sec  Add: 8935.22132819245MB/sec  Triad: 8572.960130940737MB/sec
PID 2 exits (0 secs since last report)
=== Warmup 3
PID 3 begins to allocate (0 secs since last report)
PID 3 checks time quanta (0 secs since last report)
PID 3 begins to init array (0 secs since last report)
PID 3 waits on lock (0 secs since last report)
PID 3 starts main loop (0 secs since last report)
PID: 3 Copy: 8117.707266064842MB/sec  Scale: 6927.911341415957MB/sec  Add: 8692.506164743978MB/sec  Triad: 8738.40000813827MB/sec
PID 3 exits (0 secs since last report)
=== main waits 0 secs, seen 0/1 reach barrier
PID 0 begins to allocate (0 secs since last report)
PID 0 checks time quanta (0 secs since last report)
PID 0 begins to init array (0 secs since last report)
PID 0 waits on lock (0 secs since last report)
=== Go!
PID 0 starts main loop (0 secs since last report)
PID: 0 Copy: 8022.053429710773MB/sec  Scale: 7797.282360443241MB/sec  Add: 8867.532703276056MB/sec  Triad: 8941.8872751499MB/sec
PID 0 exits (0 secs since last report)
=== Caught the last thread.
Average cpu bandwidth:  Copy: 8022MB/sec/cpu Scale: 7797MB/sec/cpu Add: 8867MB/sec/cpu Triad: 8941MB/sec/cpu
Total system bandwidth: Copy: 8022MB/sec  Scale: 7797MB/sec  Add: 8867MB/sec  Triad: 8941MB/sec
====================================================================================
2403 MHz, 0.4161 nanosec clock
STREAM copy latency: 5.01 nanoseconds
STREAM copy bandwidth: 3195.53 MB/sec
STREAM scale latency: 4.89 nanoseconds
STREAM scale bandwidth: 3273.32 MB/sec
STREAM add latency: 8.26 nanoseconds
STREAM add bandwidth: 2907.33 MB/sec
STREAM triad latency: 8.71 nanoseconds
STREAM triad bandwidth: 2754.50 MB/sec
STREAM copy latency: 13.39 nanoseconds
STREAM copy bandwidth: 11949.94 MB/sec
STREAM scale latency: 14.75 nanoseconds
STREAM scale bandwidth: 10844.01 MB/sec
STREAM add latency: 20.55 nanoseconds
STREAM add bandwidth: 11680.69 MB/sec
STREAM triad latency: 20.98 nanoseconds
STREAM triad bandwidth: 11442.06 MB/sec
"stride=128
0.00049 1.665
0.00098 1.665
0.00195 1.665
0.00293 1.665
0.00391 1.665
0.00586 1.665
0.00781 1.665
0.01172 1.665
0.01562 1.665
0.02344 1.665
0.03125 1.665
0.04688 1.665
0.06250 1.665
0.09375 14.523
0.12500 14.548
0.18750 14.493
0.25000 14.493
0.37500 14.559
0.50000 14.559
0.75000 14.556
1.00000 14.556
1.50000 14.556
2.00000 14.649
3.00000 14.556
4.00000 14.740
6.00000 189.885
8.00000 191.609
12.00000 189.892
16.00000 192.148
24.00000 190.483
32.00000 192.763
48.00000 190.542
64.00000 192.617
96.00000 192.669
128.00000 190.633
192.00000 192.765
256.00000 190.715 <- Expensive machine, slow memory!!!
384.00000 190.695
512.00000 190.719
====================================================================================
ackermann
real    0m0.19s
user    0m0.07s
sys     0m0.18s
fibo
real    0m0.19s
user    0m0.07s
sys     0m0.18s
hash2
real    0m0.19s
user    0m0.07s
sys     0m0.18s
hash
real    0m0.18s
user    0m0.07s
sys     0m0.15s
heapsort
real    0m0.18s
user    0m0.07s
sys     0m0.17s
matrix
real    0m0.21s
user    0m0.07s
sys     0m0.16s
methcall
real    0m0.19s
user    0m0.07s
sys     0m0.16s
nestedloop
real    0m0.19s
user    0m0.07s
sys     0m0.17s
objinst
real    0m0.20s
user    0m0.07s
sys     0m0.17s
random
real    0m0.20s
user    0m0.07s
sys     0m0.16s
sieve
real    0m0.20s
user    0m0.07s
sys     0m0.16s
strcat
real    0m0.19s
user    0m0.07s
sys     0m0.17s


Update 16/4/2015 - Simple CPU benchmark
dd if=/dev/zero bs=1M count=1024 | md5sum


UnixBench is trivial to download and compile it:
wget http://byte-unixbench.googlecode.com/files/UnixBench5.1.3.tgz
tar zxf ./UnixBench5.1.3.tgz
cd ./UnixBench
./Run
The tests take a while to finish. The output looks like
------------------------------------------------------------------------
Benchmark Run: Mon Oct 15 2012 23:55:22 - 00:23:16
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       12015218.4 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     2214.8 MWIPS (10.1 s, 7 samples)
Execl Throughput                                896.9 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         58968.3 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           14578.6 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        422068.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                               70993.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  16001.5 lps   (10.0 s, 7 samples)
Process Creation                               1861.8 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   2525.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    737.8 lpm   (60.1 s, 2 samples)
System Call Overhead                         432496.2 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   12015218.4   1029.6
Double-Precision Whetstone                       55.0       2214.8    402.7
Execl Throughput                                 43.0        896.9    208.6
File Copy 1024 bufsize 2000 maxblocks          3960.0      58968.3    148.9
File Copy 256 bufsize 500 maxblocks            1655.0      14578.6     88.1
File Copy 4096 bufsize 8000 maxblocks          5800.0     422068.2    727.7
Pipe Throughput                               12440.0      70993.3     57.1
Pipe-based Context Switching                   4000.0      16001.5     40.0
Process Creation                                126.0       1861.8    147.8
Shell Scripts (1 concurrent)                     42.4       2525.5    595.6
Shell Scripts (8 concurrent)                      6.0        737.8   1229.7
System Call Overhead                          15000.0     432496.2    288.3
                                                                   ========
System Benchmarks Index Score                                         249.7

------------------------------------------------------------------------
Benchmark Run: Tue Oct 16 2012 00:23:16 - 00:51:20
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       42619039.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     8274.0 MWIPS (10.4 s, 7 samples)
Execl Throughput                               3398.5 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         68332.4 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           21462.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        718205.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                              149713.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  61968.3 lps   (10.0 s, 7 samples)
Process Creation                               5321.7 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   5957.1 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    812.6 lpm   (60.1 s, 2 samples)
System Call Overhead                        1557391.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   42619039.2   3652.0
Double-Precision Whetstone                       55.0       8274.0   1504.4
Execl Throughput                                 43.0       3398.5    790.4
File Copy 1024 bufsize 2000 maxblocks          3960.0      68332.4    172.6
File Copy 256 bufsize 500 maxblocks            1655.0      21462.9    129.7
File Copy 4096 bufsize 8000 maxblocks          5800.0     718205.6   1238.3
Pipe Throughput                               12440.0     149713.5    120.3
Pipe-based Context Switching                   4000.0      61968.3    154.9
Process Creation                                126.0       5321.7    422.4
Shell Scripts (1 concurrent)                     42.4       5957.1   1405.0
Shell Scripts (8 concurrent)                      6.0        812.6   1354.3
System Call Overhead                          15000.0    1557391.5   1038.3
                                                                   ========
System Benchmarks Index Score                                         592.5
Which means that the VM in question has a score of 249.7 for single task and 592.5 for parallel processing.
Another important metric is network speed:
wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash
CPU model :  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Number of cores : 1
CPU frequency :  2600.000 MHz
Total amount of ram : 1869 MB
Total amount of swap : 1983 MB
System uptime :   34 min,
Download speed from CacheFly: 13.6MB/s
Download speed from Coloat, Atlanta GA: 8.09MB/s
Download speed from Softlayer, Dallas, TX: 7.38MB/s
Download speed from Linode, Tokyo, JP: 1.17MB/s
Download speed from i3d.net, Rotterdam, NL: 3.61MB/s
Download speed from Leaseweb, Haarlem, NL: 176KB/s
Download speed from Softlayer, Singapore: 18.0MB/s
Download speed from Softlayer, Seattle, WA: 9.12MB/s
Download speed from Softlayer, San Jose, CA: 9.77MB/s
Download speed from Softlayer, Washington, DC: 8.37MB/s
I/O speed :  439 MB/s


No comments:

Post a Comment