How to measure cache misses -- A DCPI "one pager" / PJD / 27 September 2001 DTB misses The Data Translation Buffer (DTB) retains the most recently used virtual to physical address translation information for data stream memory references. The DTB has 128 entries, where each entry typically maps a single 8,192 byte page (unless large pages are in use.) This provides a 1Mbyte "reach." The ratio of DTB misses to retired instructions describes the relative frequency of DTB miss events. Use these commands to measure this ratio: dcpiprof -i -pm dtbmiss::ret dcpiprof -pm dtbmiss::ret <> dcpilist -pm dtbmiss::ret <> <> 50 to 100 DTB misses per retire is unacceptably high. ITB misses The Instruction Translation Buffer (ITB) retains the most recently used virtual to physical address translation information for instruction stream references. The ITB also has 128 entries and a 1Mbyte reach (8Kbyte pages.) The ratio of ITB misses to retired instructions describes the relative frequency of ITB miss events. Use these commands to measure this ratio: dcpiprof -i -pm itbmiss::ret dcpiprof -pm itbmiss::ret <> dcpilist -pm itbmiss::ret <> <> ITB misses are not usually a culprit. Use SPIKE to reduce ITB misses. D-cache misses The Alpha 21264A processor does not provide a ProfileMe event to directly measure D-cache misses. Instead, ProfileMe and DCPI provide a way to identify load instructions that take a long time to complete. Use the command: dcpitopcounts -pm valid:retdelay::valid <> to find instructions with an average retire delay greater than 20. Display the code surrounding each candidate instruction: dcpilist -pm ret+valid:retdelay::valid <> <> Work back from the candidate instruction to find the instruction(s) that produce the values that it consumes, i.e., the instruction(s) that write the candidate's register operands. If a producing instruction is a load, then it is probably missing in the L1 and/or L2 data caches. See the DCPI one pager "Three common performance culprits" for an example. I-cache misses The Alpha 21264A processor does not provide a ProfileMe event to directly measure I-cache misses. Instead, it provides the "not yet prefetched" (nyp) event, which is a lower cound on I-cache misses. The nyp event is asserted when a profiled instruction is contained in an aligned 4-instruction I-cache fetch block that requested a new I-cache fill. The fill is in response to an I-cache miss. This does not count all I-cache misses, however. The ratio of nyp events to retired instructions is an optimistic approximation to the I-cache miss rate. Use the command: dcpiprof -i -pm nyp::ret dcpiprof -pm nyp::ret <> dcpilist -pm nyp::ret <> <> to apply and display this ratio. Use SPIKE to decrease I-cache misses.