Measuring load latency -- A DCPI "one pager" / PJD / 27 September 2001 DCPI provides a facility called "value profiling." For each instruction, value profiling builds a statistical, sampled profile of commonly occurring values. Value profiling must be initiated through the dcpid command, which starts the DCPI data collection daemon. Value profiling is implemented using dcpid "plug in" modules that allow extensibility through new and different modules. One important and useful type of module measures and accumulates memory latency times for load instructions. This helps to pinpoint frequently executed load instructions with poor cache behavior. You must create a latency bins file and provide the name of the file to the load latency module. The latency bins file lets you grade and report latency values by ranges that directly correspond to levels in the memory hierarchy. Here is a sample latency bins file, named "xp," for an XP-1000 workstation: # MIN MAX REP NAME # L1 (D-cache) 0 10 3 L1 # L2 (B-cache) 11 40 26 L2 # Primary memory 41 150 135 M1 151 250 200 M2 251 1500 1000 OVR Measured latency values between 0 and 10 (inclusive) will be placed into the L1 bin, values between 11 and 40 will be placed into the L2 bin, and so on. L1 cache latency on the XP-1000 is 3 cycles, L2 latency is 26 cycles, etc. The M1, M2 and OVR categories correspond to primary memory. Feel free to change this file as needed to aid analysis or to adapt it to another platform type with different memory latencies. Enter the following command to collect load latency information: dcpid -vtrace '/usr/lib/dcpi/vp-ldlatency.so xp' $DCPIDB The environment variable DCPIDB contains the path to the DCPI database directory. The -vtrace option specifies: 1. The path to the load latency module (/usr/lib/dcpi/vp-ldlatency.so), and 2. The name of a latency bins file, in this case, "xp." The option value must be quoted to keep the two parts together. After collecting data, the exact same -vtrace option must be given to dcpilist to display load latency information: dcpilist -vtrace '/usr/lib/dcpi/vp-ldlatency.so xp' -pm ret loop chase This command will display data for the procedure "loop" in executable image "chase." Focus your attention on load instructions in frequently executed code. Here is an (edited) line from dcpilist with latency information: retired :count freq vtot thld nv 29005 29102 0x120001a30 : ldq a0, 0(a0) 62552 1.0 5 latency 3.547 (98.72% (61754/62552) 3 L1) (1.05% (655/62552) 25 L2) (0.23% (141/62552) 135 M1) (0.00% (1/62552) 1000 OVR) (0.00% (1/62552) 200 M2) In the output above, "vtot" is the number of samples collected (62552.) "nv" is the number of values (bins.) Here is the first bin: (98.72% (61754/62552) 3 L1) 98.72% of the samples collected fell into this bin. The number of samples in the bin was 61754 and the total number collected was 62552. The typical bin value is "3" and the bin name is "L1." Both the typical bin value and name are specified by the latency bins file. So, 98.72% of the measured load operations hit in the L1 D-cache -- pretty good. The table at right shows the distribution of Size %L1 %L2 %M1 %M2 %OVR load latency samples for a simple program that -------------------------- chases pointers through arrays of different 32MB 8 <1 65 26 1 sizes. The distributions are not perfect, but 1MB 8 88 4 <1 <1 are good enough to assess cache behavior. 8KB 99 1 <1 <1 <1