Pages: 1 2
Background and Platform Description
The IBM PC was designed from it’s inception to be inexpensive, and system memory has always been implemented using DRAM, which is both less expensive and slower than SRAM. When the PC was introduced, DRAM implementations of the time were able to handle the bus speeds of the 8086/8088, and even the faster 80286 processor (up to 12MHz, or 80ns). With the introduction of the 80386 processor, clock speeds of 20MHz, 25MHz and even 33MHz became possible, which was faster than the DRAM memories then available. At that time, the major bottleneck was latency, which is the time lag between the request for data and its arrival into the CPU.
To help reduce this latency, designers put a small amount of SRAM between the system memory and the processor to hold the most recently used data. SRAM has a much lower latency than DRAM, but is also much more expensive. By using only a very small amount of SRAM, running at the same speed as the DRAM, the cost could be contained.
Cache theory works on the supposition that recently used data is more likely to be used again, and that data that is logically adjacent to this is more likely to be accessed than data farther away. For this reason, data in the cache is stored in blocks (called ‘lines’) of about 32 bytes, which takes four 64-bit memory transfers to fill. Thus, the typical PC memory timings will be indicated as four sets of memory bus cycles, such as 5-2-2-2 or 6-1-1-1, with the first number showing the initial latency. Note that these numbers are memory bus cycles rather than processor cycles, which will be much higher (i.e., multiply the clock ratio by these numbers to get an idea of what the processor sees in terms of latency).
The first cache implementations were put on the motherboard using the same bus speed as the system memory, because it was cheaper than trying to put it into the CPU package. When the 80486 was introduced, it included 8K of SRAM cache embedded in the chip, which ran at processor speed. To differentiate it from the cache that might be on the motherboard, it was called Level 1 (L1) cache, while the motherboard cache was called L2. With DRAM speeds limiting the speed of the memory bus, processors began to run in multiples of the memory bus speed (i.e., DX2, DX4, etc.), making the cache even more important with regards to performance.
This memory hierarchy continues to be a very cost effective way to reduce latency, however as processor speeds have ramped extremely fast, the cache implementations have had to evolve rapidly also – first getting larger, then moving closer to the processor. Today, processors may include both L1 and L2 cache on the chip, with manufacturers implementing various cache organization methods in order to provide the best performance at the lowest cost.
With the introduction of the new Athlon processors (vs. the Classic Athlon), AMD has moved to a cache implementation called ‘exclusive’, or ‘victim’ cache. In the traditional cache hierarchy, the data residing at each higher level is mirrored at the lower level, and each lower level is increasingly larger in size to hold this mirrored data, plus additional data that might be referenced. In the exclusive cache, the data is not mirrored, so a lower level cache might actually be smaller than the higher level, yet still provide some performance advantage.
On all Athlon processors, AMD included a relatively large 128K L1 cache vs. a 32K L1 cache on the Intel Pentum II/III and Celeron processors. L1 cache provides the fastest access, not only because the latency is very small (perhaps 2 processor cycles), but also because the entire 256 bits is transferred at once (vs. 64 bits on the memory bus). In this case more is usually better – though there is a point where the latency increases because of the overhead of searching the cache. The L2 cache on the Athlon processors is 256K (for the TBird) and 64K (on the Duron). This initially caused a bit of confusion for some, because they didn’t understand the exclusive cache concept, and wondered how the L2 cache can be smaller. Timings for L2 cache are slower than L1, because the bus width is 64 bits (requiring four data transfers), and the initial latency is a few cycles longer because it is searched after L1.
In order to determine what effect the smaller L2 cache of the Duron has on performance, I acquired a 700MHz version of both the Duron and Tbird processors. By doing this, I was able to test both processors on exactly the same hardware and software configuration by simply dropping in each processor, thus isolating any differences to only the L2 cache size.
All tests were run on Windows 98, using the ‘Setup Defaults’ in the BIOS. The following hardware and drivers were used:
Hardware:
- AMD Athlon (Tbird) 700MHz
- AMD Duron 700MHz
- Gigabyte GA-7ZM motherboard
- Nvidia Vanta video card w/8MB
- Seagate ST320423A 20.4GB HDD
- Crucial Technology PC133 SDRAM
Operating System and Drivers:
- Windows 98 SE
- VIA INF driver V1.02
- VIA 4x AGP/133 Diver V4.02
- VIA All-in-One Driver V1.01
- DirectX 7.0a
Be the first to discuss this article!