When Intel released the first Core 2 Duo, it was a quantum leap in performance and efficiency compared to the previous P4-based designs. After two years of lagging in gaming performance, Intel took the lead in a rather dramatic fashion from AMD’s K8 – with a 2.4GHz Core 2 Duo often outperforming a 2.8GHz Athlon64 FX-62. When July 16th, 2006 rolled around there were probably dozens of reviews testifying to this fact with a broad spectrum of gaming benchmarks.
Testing performance is valuable, as it helps consumers decide what products to purchase. However, it often does little to explain the results. In some cases it is clear – for instance, many people know that the SPECfp sub-tests are extremely bandwidth dependent and likely to show an increase in performance for anything which improves memory bandwidth.
A while back, we wrote an overview of VTune – Intel’s performance analysis tool. AMD also offers a similar tool, CodeAnalyst. In this review we will use these tools to examine several gaming benchmarks at various settings to try and explore the reasons for the performance differences between the dual core AMD K8 and Intel’s Core 2 Duo. This first part focuses on examining comparable aspects of the two microarchitectures, such as the overall IPC, branch prediction and cache hierarchies. The second piece will focus on some microarchitecture specific aspects of each CPU (such as macro-op fusion for the Core 2).
This is the first time that any review online has been done with the goal of analyzing benchmark performance using event-based monitoring (via VTune and CodeAnalyst). Consequently, the methods used are somewhat unique and merit special attention.
VTune and CodeAnalyst can automate repeated runs of a single executable under Windows. To automate each benchmark run, we used Visual Basic .Net scripting to input command key strokes to the various applications; we could not readily automate mouse input, which removed many benchmarks from consideration.
First, the benchmarks were run manually (i.e. with no scripting) to calculate the overhead for event-based sampling, which was determined to be relatively minor. To obtain sampling data, each benchmark was run once without any sampling, as a warm-up run.
The next step was to collect the actual event-based sampling data with a 1KHz resolution (sampling every 1ms). At this point, the methods for using VTune and CodeAnalyst took a radically different turn due to some feature differences in VTune and CodeAnalyst.
The K8 can measure any 4 events via the performance counters simultaneously, while the Core 2 can only measure certain event combinations (depending on which particular counters are needed). So the number of test runs was different between the two platforms.
Second, the rate at which each application (VTune and CodeAnalyst) samples the event counters is set in a different way. CodeAnalyst is manually configured to sample the event counters at a rate specified by the user (i.e. every N events). In contrast, VTune by default is given a time based rate (e.g. 1KHz sampling – every 1ms) to target; VTune then conducts a dry run with event-based sampling and based on those results calculates how often it must sample in order to achieve the targeted frequency for the later full run where data is collected. The major difference is that the user must run the calculations themselves for CodeAnalyst in order to find the event frequency and sampling rate which corresponds to 1KHz (sampling much more often than 1MHz produces too much data).
Gathering the event-based sampling data with VTune proved to be extremely easy, since each run of a benchmark would produce almost exactly the same results for the event counters of interest. Alas, this was not the case for CodeAnalyst. The run-to-run variation was significant for CodeAnalyst (e.g. in Run 1, we might find IPC of 0.5 for a benchmark, and in Run 2 get 0.9 on the same benchmark) and some results seemed to be extremely unlikely. After extensive consultation with architects at Intel and AMD and the software developers for CodeAnalyst and VTune, we re-ran the tests with CodeAnalyst half a dozen times before getting results that seemed mostly correct, with a few outliers. Throughout this article, we will point out specific data that seems questionable in our analysis and commentary for the benefit of the reader.
Once the data had been collected, the last step was to convert the raw data into meaningful ratios. For example, the raw number of L2 cache misses sampled is not particularly informative by itself – however, the number of cache misses per instruction retired is a very useful metric. In general, most of the metrics we use are normalized by the number of instructions retired.