Tuesday, 29 October 2013

Memory latency using lat_mem_rd from lmbench

Latency on my desktop CPU Intel i5-2500


[zorang@centos6 x86_64-linux-gnu]$ ./mhz
mhz: should take approximately 297 seconds

3355 MHz, 0.2981 nanosec clock



[zorang@centos6 x86_64-linux-gnu]$  numactl --membind=0 --cpunodebind=0 ./lat_mem_rd 2000 128
"stride=128
0.00049 1.205
0.00098 1.198
0.00195 1.195
0.00293 1.209
0.00391 1.211
0.00586 1.201
0.00781 1.199
0.01172 1.201
0.01562 1.194
0.02344 1.200
0.03125 1.217
0.04688 3.523
0.06250 3.646
0.09375 3.616
0.12500 3.611
0.18750 3.658
0.25000 4.928
0.37500 5.837
0.50000 5.791
0.75000 5.843
1.00000 5.883
1.50000 5.959
2.00000 5.983
3.00000 6.174
4.00000 9.150
6.00000 15.852
8.00000 19.982
12.00000 21.567
16.00000 21.585
24.00000 21.735
32.00000 21.610
48.00000 22.535
64.00000 22.093
96.00000 22.033
128.00000 22.608
192.00000 21.498
256.00000 21.594
384.00000 21.492
512.00000 21.473
768.00000 22.752
1024.00000 22.462


Can easily see when cache is no longer effective:


1. After 32KB L1 cache full latency increases
2. After 256KB L2 cache full latency increases
3. After 6MB L3 cache full latency increases to main memory latency
* Please note numactl --membind=0 --cpunodebind=0 bind CPU and memory so all memory is local (important for NUMA based servers)






See Intel Memory Latency Checker - available for Linux/Windows:
1. A matrix of idle memory latencies for requests originating from each of the sockets and addressed to each of the available sockets
2. Peak memory b/w measured (assuming all accesses are to local memory) for requests with varying amounts of reads and writes 
3. A matrix of memory b/w values for requests originating from each of the sockets and addressed to each of the available sockets
4. Latencies at different b/w points
It also measures cache-to-cache data transfer latencies
Intel® MLC also provides command line arguments for fine grained control over latencies and b/w that are measured.
Here are some of the things that are possible with command line arguments:
  • Measure latencies for requests addressed to a specific memory controller from a specific core
  • Measure cache latencies
  • Measure b/w from a subset of the cores/sockets
  • Measure b/w for different read/write ratios
  • Measure latencies for random address patterns instead of sequential
  • Change stride size for latency measurements
  • Measure cache-to-cache data transfer latencies

No comments:

Post a Comment