Can the program run faster? -- A DCPI "one pager" / PJD / 27 September 2001 Developers and users alike want to know if a program is getting the most out of the machine and if the program can be made to execute faster. DCPI can help make this assessment for compute-intensive programs. DCPI is not the appropriate tool for I/O- or synchronization-related issues. A retired instruction is an instruction that has successfully completed execution. It is a computational step that moves program execution ahead. A retired instruction is a very basic measure of "completed useful work." It makes sense, therefore, to compare the number of retired instructions against processor cycles, wasted (aborted) instructions, and events that may rob a program of performance. Retired instructions per cycle and cycles per retired instruction, also known as "CPI," are rough measures of concurrency, i.e., how well the program is exploiting the machine resources. If the processor is retiring, on average, more than one instruction per cycle, then concurrency is being realized. Architectural research shows that 3.3 retires per cycle can be achieved in dense integer code with loads that hit in the L1 D-cache. Practically, this rate is unlikely to be sustained. Therefore, a retire/cycle ratio that approaches 3 is very good for integer code. Similar analysis for floating point shows a maximum of 2.3 retires per cycle, again for only short periods. Thus, a retire/cycle ratio approaching 2 is very good for floating point code with load instructions hitting in the L1 D-cache. The retire per cycle ratio can be applied to images, procedures and individual instructions. Use the command: dcpiprof -i -pm retired -event cycles to identify the program images where most time is spent (largest number of cycles) or where the most work is done (largest number of retired instructions.) For a given candidate image, use the command: dcpiprof -pm retired -event cycles <> to display the number of retired instruction samples and processor cycle samples for each procedure in the specified image. Next, compute the ratio of retires per cycle for each procedure. Here is an example: retired cycles % cum% :count % procedure retired/cycles 73297 37.80% 37.80% 200731 46.96% addmul 2.74 15166 7.82% 45.62% 27936 6.54% mod 1.84 11816 6.09% 51.71% 12357 2.89% divrem 1.04 <-- 9365 4.83% 56.54% 23107 5.41% sub 2.46 Take note of any hot procedures with a low retired/cycle ratio. The third procedure above (divrem) is a possible candidate for further investigation. The amount of improvement that can be made partly depends upon the size of the contribution of the image or procedure to the overall computation. In the example above, the procedure addmul accounts for ~38% of the total number of processor cycles (time) in the application. Although procedure divrem has a very poor retire/cycle ratio, it accounts for only 6% of the processor cycles. Thus, divrem must be substantially improved in order to affect the bottom line. Unless runtime is extremely critical, improving divrem is not worth the time and effort needed to speed it up. An aborted instruction is an instruction that did not retire successfully. Instructions are aborted when branch mispredictions, replay traps and other events disrupt the flow of the processor pipeline, forcing work to be discarded. Use the commands: dcpiprof -i -pm ret+ret::/ret dcpiprof -pm ret+ret::/ret <> to display the ratio of retired to non-retired (aborted) instructions for images and procedures within a specific image, respectively. The ratio is a rough measure of useful work completed to non-productive work that was thrown away. Due to hardware limitations, it is not possible to measure the number of wasted processor cycles.