Zoran's Blog: Simple Linux benchmark

Very quick and dirty benchmark to test new Linux server performance:

[root@centos run]# cat zbench.linux

#!/bin/ksh
# wget http://www.bitmover.com/lmbench/lmbench3.tar.gz
# ps -e -o "pid,ppid,user,group,cpu,time,etime,pcpu,vsz,thcount,args" | sort -nrk 8 | head

# Quick and dirty bc benchmark

(

time ksh loopbench 1000 ./command1

echo "===================================================================================="

for i in {1..20}; do cat lmbench-3.0-a9.tar >> big; done

time ksh loopbench 100 ./command2

rm big

echo "===================================================================================="

# Stream - Memory bandwidth benchmark

stream/stream

echo "===================================================================================="

# LMbench - low level benchmark

lmbench-3.0-a9/bin/i686-pc-linux-gnu/mhz

lmbench-3.0-a9/bin/i686-pc-linux-gnu/stream

lmbench-3.0-a9/bin/i686-pc-linux-gnu/stream -P 10

lmbench-3.0-a9/bin/i686-pc-linux-gnu/lat_mem_rd 512

echo "===================================================================================="

cd javabench; ksh -x run.sh

) 2>&1 | tee zbench.`hostname`

[root@centos run]#

[root@centos run]# cat loopbench

#!/bin/ksh

date

x=1

while [ $x -le $1 ]

(

$2 $3 $4 $5 $6 $7 $8 $9 &

)

let x=$x+1
done

echo Waiting...

wait

date

[root@centos run]#

[root@centos run]# cat command1

bc > /dev/null <<!

sqrt(1234^1234)

[root@centos run]#

[root@centos run]# cat command2

gzip -c big > /dev/null

[root@centos run]#

[root@centos run]# cat javabench/run.sh

time java -server -Xmx512M ackermann 13

time java -server fibo 45

time java -server hash2 3000

time java -server -Xmx512M hash 3000000

time java -server -Xmx512m heapsort 10000000

time java -server matrix 100000

time java -server methcall 1000000000

time java -server nestedloop 45

time java -server objinst 100000000

time java -server random 300000000

time java -server sieve 100000;

time java -server -Xmx450m strcat 10000000

[root@centos run]#

Sample run from a VM on HP N40L with 2 CPU cores:

[root@centos run]# cat zbench.centos.vm_on_n40l-2c

Tue May 29 15:57:57 EST 2012

Waiting...

Tue May 29 16:05:40 EST 2012

real 7m43.09s

user 0m0.13s

sys 0m1.23s

====================================================================================

Tue May 29 16:05:53 EST 2012

Waiting...

Tue May 29 16:06:42 EST 2012

real 0m49.47s

user 0m0.02s

sys 0m0.15s

====================================================================================

-------------------------------------------------------------

STREAM version $Revision: 5.9 $

-------------------------------------------------------------

This system uses 8 bytes per DOUBLE PRECISION word.

-------------------------------------------------------------

Array size = 2000000, Offset = 0

Total memory required = 45.8 MB.

Each test is run 10 times, but only

the *best* time for each is used.

-------------------------------------------------------------

Printing one line per active thread....

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 17870 microseconds.

(= 17870 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 1678.2010 0.0195 0.0191 0.0199

Scale: 1643.5473 0.0231 0.0195 0.0482

Add: 1895.8864 2.2737 0.0253 5.2402

Triad: 1777.4551 2.2593 0.0270 5.0704

-------------------------------------------------------------

Solution Validates

-------------------------------------------------------------

====================================================================================

1482 MHz, 0.6748 nanosec clock

STREAM copy latency: 4.30 nanoseconds

STREAM copy bandwidth: 3718.93 MB/sec

STREAM scale latency: 4.27 nanoseconds

STREAM scale bandwidth: 3750.79 MB/sec

STREAM add latency: 9.55 nanoseconds

STREAM add bandwidth: 2513.00 MB/sec

STREAM triad latency: 8.86 nanoseconds

STREAM triad bandwidth: 2708.24 MB/sec

STREAM copy latency: 33.81 nanoseconds

STREAM copy bandwidth: 4733.02 MB/sec

STREAM scale latency: 33.75 nanoseconds

STREAM scale bandwidth: 4741.35 MB/sec

STREAM add latency: 55.61 nanoseconds

STREAM add bandwidth: 4316.11 MB/sec

STREAM triad latency: 51.38 nanoseconds

STREAM triad bandwidth: 4671.42 MB/sec

"stride=128

0.00049 2.053

0.00098 2.054

0.00195 2.054

0.00293 2.053

0.00391 2.052

0.00586 2.039

0.00781 2.038

0.01172 2.038

0.01562 2.039

0.02344 2.036

0.03125 2.038

0.04688 2.040

0.06250 2.042

0.09375 10.448

0.12500 10.452

0.18750 10.456

0.25000 10.470

0.37500 10.464

0.50000 10.476

0.75000 18.414

1.00000 31.156

1.50000 43.502

2.00000 49.361

3.00000 49.480

4.00000 49.587

6.00000 50.297

8.00000 50.249

12.00000 50.312

16.00000 51.070

24.00000 50.377

32.00000 50.364

48.00000 49.446

64.00000 49.434

96.00000 50.209

128.00000 50.186

192.00000 49.416

256.00000 49.358

384.00000 49.380

512.00000 49.316

====================================================================================

+ bin/java -server -Xmx512M ackermann 13

real 0m1.27s

user 0m0.75s

sys 0m0.08s

+ bin/java -server fibo 45

1836311903

real 0m16.26s

user 0m15.98s

sys 0m0.04s

+ bin/java -server hash2 3000

1 9999 3000 29997000

real 0m6.78s

user 0m6.67s

sys 0m0.05s

+ bin/java -server -Xmx512M hash 3000000

299999

real 0m18.71s

user 0m21.32s

sys 0m1.63s

+ bin/java -server -Xmx512m heapsort 10000000

0.9999928555

real 0m13.83s

user 0m13.14s

sys 0m0.23s

+ bin/java -server matrix 100000

270165 1061760 1453695 1856025

real 0m14.90s

user 0m14.59s

sys 0m0.03s

+ bin/java -server methcall 1000000000

true

false

real 0m2.80s

user 0m2.64s

sys 0m0.05s

+ bin/java -server nestedloop 45

-286168967

real 0m20.02s

user 0m19.45s

sys 0m0.04s

+ bin/java -server objinst 100000000

false

true

false

true

false

true

false

true

real 0m2.60s

user 0m2.32s

sys 0m0.21s

+ bin/java -server random 300000000

92.485425240

real 0m18.52s

user 0m22.21s

sys 0m8.99s

+ bin/java -server sieve 100000

Count: 1028

real 0m13.41s

user 0m13.05s

sys 0m0.03s

+ bin/java -server -Xmx450m strcat 10000000

60000000

real 0m1.95s

user 0m1.20s

sys 0m0.71s

Can run on other UNIX platforms - Sample run from a Sun M5000 with 32 CPU cores:

[root@centos run]# cat bnech.vfecos031.m5000-32c

Wednesday, 3 March 2010 2:06:17 PM EST

Waiting...

Wednesday, 3 March 2010 2:06:31 PM EST

====================================================================================

-------------------------------------------------------------

STREAM version $Revision: 5.9 $

-------------------------------------------------------------

This system uses 8 bytes per DOUBLE PRECISION word.

-------------------------------------------------------------

Array size = 2000000, Offset = 0

Total memory required = 45.8 MB.

Each test is run 10 times, but only

the *best* time for each is used.

-------------------------------------------------------------

Printing one line per active thread....

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 7607 microseconds.

(= 7607 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 2851.0255 0.0113 0.0112 0.0114

Scale: 2812.7274 0.0114 0.0114 0.0114

Add: 2915.6217 0.0165 0.0165 0.0165

Triad: 2951.3104 0.0163 0.0163 0.0164

-------------------------------------------------------------

Solution Validates

-------------------------------------------------------------

=== Warmup 0

PID 0 begins to allocate (0 secs since last report)

PID 0 checks time quanta (0 secs since last report)

PID 0 begins to init array (0 secs since last report)

PID 0 waits on lock (0 secs since last report)

PID 0 starts main loop (0 secs since last report)

PID: 0 Copy: 6122.056921308876MB/sec Scale: 3389.4736478254817MB/sec Add: 6422.2609611035305MB/sec Triad: 5877.310779371098MB/sec

PID 0 exits (0 secs since last report)

=== Warmup 1

PID 1 begins to allocate (0 secs since last report)

PID 1 checks time quanta (0 secs since last report)

PID 1 begins to init array (0 secs since last report)

PID 1 waits on lock (0 secs since last report)

PID 1 starts main loop (0 secs since last report)

PID: 1 Copy: 8034.13325352139MB/sec Scale: 7171.66593641464MB/sec Add: 9314.964883724713MB/sec Triad: 8413.668242323325MB/sec

PID 1 exits (0 secs since last report)

=== Warmup 2

PID 2 begins to allocate (0 secs since last report)

PID 2 checks time quanta (0 secs since last report)

PID 2 begins to init array (0 secs since last report)

PID 2 waits on lock (0 secs since last report)

PID 2 starts main loop (0 secs since last report)

PID: 2 Copy: 7972.0970691415305MB/sec Scale: 7009.859652401344MB/sec Add: 8935.22132819245MB/sec Triad: 8572.960130940737MB/sec

PID 2 exits (0 secs since last report)

=== Warmup 3

PID 3 begins to allocate (0 secs since last report)

PID 3 checks time quanta (0 secs since last report)

PID 3 begins to init array (0 secs since last report)

PID 3 waits on lock (0 secs since last report)

PID 3 starts main loop (0 secs since last report)

PID: 3 Copy: 8117.707266064842MB/sec Scale: 6927.911341415957MB/sec Add: 8692.506164743978MB/sec Triad: 8738.40000813827MB/sec

PID 3 exits (0 secs since last report)

=== main waits 0 secs, seen 0/1 reach barrier

PID 0 begins to allocate (0 secs since last report)

PID 0 checks time quanta (0 secs since last report)

PID 0 begins to init array (0 secs since last report)

PID 0 waits on lock (0 secs since last report)

=== Go!

PID 0 starts main loop (0 secs since last report)

PID: 0 Copy: 8022.053429710773MB/sec Scale: 7797.282360443241MB/sec Add: 8867.532703276056MB/sec Triad: 8941.8872751499MB/sec

PID 0 exits (0 secs since last report)

=== Caught the last thread.

Average cpu bandwidth: Copy: 8022MB/sec/cpu Scale: 7797MB/sec/cpu Add: 8867MB/sec/cpu Triad: 8941MB/sec/cpu

Total system bandwidth: Copy: 8022MB/sec Scale: 7797MB/sec Add: 8867MB/sec Triad: 8941MB/sec

====================================================================================

2403 MHz, 0.4161 nanosec clock

STREAM copy latency: 5.01 nanoseconds

STREAM copy bandwidth: 3195.53 MB/sec

STREAM scale latency: 4.89 nanoseconds

STREAM scale bandwidth: 3273.32 MB/sec

STREAM add latency: 8.26 nanoseconds

STREAM add bandwidth: 2907.33 MB/sec

STREAM triad latency: 8.71 nanoseconds

STREAM triad bandwidth: 2754.50 MB/sec

STREAM copy latency: 13.39 nanoseconds

STREAM copy bandwidth: 11949.94 MB/sec

STREAM scale latency: 14.75 nanoseconds

STREAM scale bandwidth: 10844.01 MB/sec

STREAM add latency: 20.55 nanoseconds

STREAM add bandwidth: 11680.69 MB/sec

STREAM triad latency: 20.98 nanoseconds

STREAM triad bandwidth: 11442.06 MB/sec

"stride=128

0.00049 1.665

0.00098 1.665

0.00195 1.665

0.00293 1.665

0.00391 1.665

0.00586 1.665

0.00781 1.665

0.01172 1.665

0.01562 1.665

0.02344 1.665

0.03125 1.665

0.04688 1.665

0.06250 1.665

0.09375 14.523

0.12500 14.548

0.18750 14.493

0.25000 14.493

0.37500 14.559

0.50000 14.559

0.75000 14.556

1.00000 14.556

1.50000 14.556

2.00000 14.649

3.00000 14.556

4.00000 14.740

6.00000 189.885

8.00000 191.609

12.00000 189.892

16.00000 192.148

24.00000 190.483

32.00000 192.763

48.00000 190.542

64.00000 192.617

96.00000 192.669

128.00000 190.633

192.00000 192.765

256.00000 190.715 <- Expensive machine, slow memory!!!

384.00000 190.695

512.00000 190.719

====================================================================================

ackermann

real 0m0.19s

user 0m0.07s

sys 0m0.18s

fibo

real 0m0.19s

user 0m0.07s

sys 0m0.18s

hash2

real 0m0.19s

user 0m0.07s

sys 0m0.18s

hash

real 0m0.18s

user 0m0.07s

sys 0m0.15s

heapsort

real 0m0.18s

user 0m0.07s

sys 0m0.17s

matrix

real 0m0.21s

user 0m0.07s

sys 0m0.16s

methcall

real 0m0.19s

user 0m0.07s

sys 0m0.16s

nestedloop

real 0m0.19s

user 0m0.07s

sys 0m0.17s

objinst

real 0m0.20s

user 0m0.07s

sys 0m0.17s

random

real 0m0.20s

user 0m0.07s

sys 0m0.16s

sieve

real 0m0.20s

user 0m0.07s

sys 0m0.16s

strcat

real 0m0.19s

user 0m0.07s

sys 0m0.17s

Update 16/4/2015 - Simple CPU benchmark

dd if=/dev/zero bs=1M count=1024 | md5sum

UnixBench is trivial to download and compile it:

wget http://byte-unixbench.googlecode.com/files/UnixBench5.1.3.tgz
tar zxf ./UnixBench5.1.3.tgz
cd ./UnixBench
./Run

The tests take a while to finish. The output looks like

------------------------------------------------------------------------
Benchmark Run: Mon Oct 15 2012 23:55:22 - 00:23:16
4 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       12015218.4 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     2214.8 MWIPS (10.1 s, 7 samples)
Execl Throughput                                896.9 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         58968.3 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           14578.6 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        422068.2 KBps  (30.0 s, 2 samples)
Pipe Throughput                               70993.3 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  16001.5 lps   (10.0 s, 7 samples)
Process Creation                               1861.8 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   2525.5 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    737.8 lpm   (60.1 s, 2 samples)
System Call Overhead                         432496.2 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   12015218.4   1029.6
Double-Precision Whetstone                       55.0       2214.8    402.7
Execl Throughput                                 43.0        896.9    208.6
File Copy 1024 bufsize 2000 maxblocks          3960.0      58968.3    148.9
File Copy 256 bufsize 500 maxblocks            1655.0      14578.6     88.1
File Copy 4096 bufsize 8000 maxblocks          5800.0     422068.2    727.7
Pipe Throughput                               12440.0      70993.3     57.1
Pipe-based Context Switching                   4000.0      16001.5     40.0
Process Creation                                126.0       1861.8    147.8
Shell Scripts (1 concurrent)                     42.4       2525.5    595.6
Shell Scripts (8 concurrent)                      6.0        737.8   1229.7
System Call Overhead                          15000.0     432496.2    288.3
                                                                   ========
System Benchmarks Index Score                                         249.7

------------------------------------------------------------------------
Benchmark Run: Tue Oct 16 2012 00:23:16 - 00:51:20
4 CPUs in system; running 4 parallel copies of tests

Dhrystone 2 using register variables       42619039.2 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     8274.0 MWIPS (10.4 s, 7 samples)
Execl Throughput                               3398.5 lps   (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks         68332.4 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks           21462.9 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks        718205.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                              149713.5 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                  61968.3 lps   (10.0 s, 7 samples)
Process Creation                               5321.7 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   5957.1 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                    812.6 lpm   (60.1 s, 2 samples)
System Call Overhead                        1557391.5 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   42619039.2   3652.0
Double-Precision Whetstone                       55.0       8274.0   1504.4
Execl Throughput                                 43.0       3398.5    790.4
File Copy 1024 bufsize 2000 maxblocks          3960.0      68332.4    172.6
File Copy 256 bufsize 500 maxblocks            1655.0      21462.9    129.7
File Copy 4096 bufsize 8000 maxblocks          5800.0     718205.6   1238.3
Pipe Throughput                               12440.0     149713.5    120.3
Pipe-based Context Switching                   4000.0      61968.3    154.9
Process Creation                                126.0       5321.7    422.4
Shell Scripts (1 concurrent)                     42.4       5957.1   1405.0
Shell Scripts (8 concurrent)                      6.0        812.6   1354.3
System Call Overhead                          15000.0    1557391.5   1038.3
                                                                   ========
System Benchmarks Index Score                                         592.5

Which means that the VM in question has a score of 249.7 for single task and 592.5 for parallel processing.

Another important metric is network speed:

wget freevps.us/downloads/bench.sh -O - -o /dev/null|bash

CPU model :  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Number of cores : 1
CPU frequency :  2600.000 MHz
Total amount of ram : 1869 MB
Total amount of swap : 1983 MB
System uptime :   34 min,
Download speed from CacheFly: 13.6MB/s
Download speed from Coloat, Atlanta GA: 8.09MB/s
Download speed from Softlayer, Dallas, TX: 7.38MB/s
Download speed from Linode, Tokyo, JP: 1.17MB/s
Download speed from i3d.net, Rotterdam, NL: 3.61MB/s
Download speed from Leaseweb, Haarlem, NL: 176KB/s
Download speed from Softlayer, Singapore: 18.0MB/s
Download speed from Softlayer, Seattle, WA: 9.12MB/s
Download speed from Softlayer, San Jose, CA: 9.77MB/s
Download speed from Softlayer, Washington, DC: 8.37MB/s
I/O speed :  439 MB/s

Zoran's Blog

Sunday, 10 November 2013

Simple Linux benchmark

No comments:

Post a Comment