Three common performance culprits -- A DCPI "one pager" / PJD / 28 Sept 2001 Ratios are a better means of evaluating and comparing results than raw event counts, especially when comparing results across runs. Relative frequency of an event is easier to interpret than just a big number. DCPI collects data by sampling. When assessing instruction-level performance, results for any instruction with less than 100 retire samples are not statistically signficant. Increase the program runtime or measure across multiple runs to increase the number of samples collected. A retired instruction is one that has successfully completed execution. Conditional branch mispredicts The Alpha processor predicts the direction of conditional branches and fetches/executes new instructions using the prediction. If the prediction rate is low, the CPU must recover frequently and much work is discarded. Conditional branches with a high ratio of conditional branch mispredicts (misprediction rate greater than 10%) may be a problem. The command: dcpitopcounts -pm cbrmispredict::retired <> displays instructions with the highest misprediction rate. Use the command: dcpilist -pm retired+cbrmispredict::retired <> <> to display code for the procedure containing the culprit instruction. Memory system replay traps The Alpha processor executes instructions out-of-order, but has ordering rules to preserve correctness. These rules are applied to load and store instructions relatively late in the pipeline. If a rule fails, some results must be discarded and instructions must be replayed. Procedures and instructions with a low ratio of retired instructions to replays (below 100) may indicate a problem. Here are commands to display these ratios: dcpiprof -i -pm retired+retired::replays dcpiprof -pm retired+retired::replays <> Display ratios and code for a procedure in an image with: dcpilist -pm retired+retired::replays <> <> Load latency when reading from memory Memory data in the L1 D-cache can be read quickly. L2 cache is slower (20+ cycles) and primary memory is slower still (130+ cycles.) Find instructions with a high average retire delay (greater than 20.) Determine if those instructions consume a value produced by a nearby load. Enter: dcpitopcounts -pm valid:retdelay::valid <> This command displays a list of instructions with the highest average retire delay in the image. If an instruction has an average retire delay greater than 20, it is a candidate for further investigation. For each candidate, use the command: dcpilist -pm ret+valid:retdelay::valid <> <> to display the code for the procedure containing the candidate instruction. Look back through the code to find the instruction(s) that produce the values that are consumed by the candidate, for example: ldq t9, 144(sp) <- Load instruction (the producer) ... <- Zero or more instructions subq t9, t8, t5 <- Candidate (the consumer) If the producer is a load instruction, then the load instruction is probably suffering L1 or L2 cache misses. Retire delay is a lower bound on the number of processor cycles that an instruction has delayed its own retirement and the retirement of the other instructions that follow it in program order. Retire delay (retdelay) is defined only for valid instructions that retire without causing a trap. Consumers of multi-cycle instructions like integer or floating multiply have higher average retire delay and are sometimes "false positives."