Event Based Sampling
Event based sampling is probably the most interesting feature in VTune to a microarchitect. It uses special hardware counters inside Intel MPU’s to measure a number of ‘events’ to monitor during the execution of a workload. Some examples of events include branch mispredictions, trace cache flushes or loads retired. A comprehensive list for different Intel x86 cores can be found starting on page 291 in Intel’s Software Development Manual. The events vary for each architecture and core. Obviously, predication is not an issue for any x86 microprocesor, while trace cache events only occur for Pentium 4 based cores. The user will select a set of events from this list, and an application to analyze. VTune then collects data during the execution of the application.
The actual measurement mechanism is relatively simple and very low overhead (< 1% overhead). Periodically, the VTune analyzer collects data from the processor via an interrupt. The frequency that the interrupts are issued is either based on a certain number of events occurring (i.e. every 2000 instructions retired) or an external reference clock (usually the OS timer). The former mode is referred to as event-based sampling (EBS), while the latter is known as time-based sampling (TBS). Even within EBS, VTune analyzer normally calibrates itself to sample at 1ms intervals. When the interrupt occurs, VTune analyzer reads a set of registers that describe the execution context, including the dynamic execution address in memory, the associated process ID, thread ID and module. If the source code is available, then the collector can even identify the line of code that the execution address maps to.
Each experiment can take several runs of an application to actually compile all the requested data, depending on how many counters are to be collected. Unfortunately, each CPU has a fixed number of registers (4 for Woodcrest, but 12 for Montecito) used for event sampling, and these are subject to certain restrictions; some events simply cannot be measured simultaneously. VTune analyzer can gather quite a bit of data, but it does have its limits. At a 1ms (1MHz) target resolution, any workload running more than 25 minutes becomes somewhat problematic. That much data is too large to be easily manipulated and displayed; usually this is addressed by decreasing the target sampling frequency.
The results can be examined on a per process, per thread or per module basis, to find existing bottlenecks. Figure 1 shows per process results from a sample run. EBS also has ‘hot spot’ analysis, which can find and identify bottlenecks at critical relative virtual addresses, functions, source lines or classes (although the latter three require debug information). For some bottlenecks, VTune also offers suggestions on improving performance for developers. For example, trace cache flushes are very problematic for Pentium 4 derived CPUs, because the Pentium 4 needs the trace cache to issue more than 1 instruction/cycle. VTune analyzer includes guidelines so developers will know if they are experiencing too many trace cache flushes in a particular workload, and if so, which module is causing problems. Then the programmer can pinpoint which particular instructions are causing trace cache flushes and hopefully eliminate or reduce said instructions.
Figure 1 – Output from Event Based Sampling
Figure 1 shows an example of EBS used in conjunction with a popular game, Bethesda’s Oblivion. With a single run, both instructions retired and clock ticks were recorded, which then yielded up Cycles Per Instruction (and a surprisingly high CPI at that!). As an aside, advanced users can actually create their own customized events for sampling. This requires a fair knowledge of Intel’s microarchitecture, and some assembly programming, but could be quite interesting.