Memory Tests
In the memory tests, different operations are performed using several different block sizes in order to determine the speed of L1 and L2 cache, as well as system memory. These operations are read, write, read-modify-write, and random access. According to the help file “These tests are implemented in the same manner as memory accesses in normal applications, and are not optimized to achieve maximum throughput. However, since no other tasks are run while performing the memory transfers, quite high throughput numbers can be expected.“
Included are several Video Memory tests, which I did not record the results of. This is because after several runs on different platforms with the same video card, I found the numbers to be very consistent. The overall memory score does, however, include the video test results, as follows:
“Read(3072*32+1536*32+384*16+48*1+6*1) + Write(3072*32+1536*32+384*16+48*1+6*1) + Modify(3072*64+1536*64+384*32+48*2+6*2) + Container(1536*64+768*64+384*32+96*4+48*2+6*2) + VideoMem(1*4+4*8+16*16+32*32) } / 160 “
Since all platforms used the same video card, this should not be a problem with regards to comparing the other components. According to the help file, a high-end PC should get around 5000 points as total Memory score. Because these tests were all performed on SDRAM based systems, we can’t expect to be near the scores for a ‘high end PC’, but in this particular scenario all we really care about is the relative scores between different components and setups. Let’s see how this looks:
P4 Willamette |
1.4GHz |
1.6GHz |
1.8GHz |
2.0GHz |
Block Read – 3072KB |
964.5 |
966.6 |
968 |
969.7 |
Block Read – 1536KB |
964.5 |
966.6 |
968.1 |
969.1 |
Block Read – 384KB |
964.5 |
965.7 |
984.6 |
977.9 |
Block Read – 48KB |
5652.2 |
6475.5 |
7287.8 |
8094.1 |
Block Read – 6KB |
10449.2 |
11944.7 |
13200.1 |
14919.2 |
Block Write – 3072KB |
416.6 |
402.2 |
401 |
399.2 |
Block Write – 1536KB |
417.7 |
400.7 |
400.4 |
400.3 |
Block Write – 384KB |
420.3 |
403.9 |
405.4 |
402.3 |
Block Write – 48KB |
4712.6 |
5390.3 |
6059.7 |
6737.7 |
Block Write – 6KB |
4651.1 |
5316.4 |
5973.4 |
6644.9 |
Block Modify – 3072KB |
417.6 |
404.2 |
403.6 |
403.6 |
Block Modify – 1536KB |
416.7 |
404.6 |
402.6 |
404.6 |
Block Modify – 384KB |
425.9 |
403.6 |
402.3 |
403 |
Block Modify – 48KB |
3391.6 |
3684.6 |
4653.1 |
4616.3 |
Block Modify – 6KB |
4458.2 |
5097.5 |
5903.2 |
6373.3 |
Random Access – 1536KB |
921.9 |
925.1 |
932.2 |
936 |
Random Access – 768KB |
920.9 |
922.6 |
930.4 |
934 |
Random Access – 384KB |
922.8 |
921.2 |
928 |
932.9 |
Random Access – 96KB |
3769.2 |
4323.6 |
4849.2 |
5404.5 |
Random Access – 48KB |
3759.3 |
4331.7 |
4842.2 |
5414.1 |
Random Access – 6KB |
5350 |
6088.3 |
6769.2 |
7326.7 |
Memory Overall |
2523 |
2553 |
2613 |
2660 |
P4 Northwood |
2.0A GHz |
2.2 GHz |
2.4 GHz |
Block Read – 3072KB |
970.1 |
971.5 |
973.4 |
Block Read – 1536KB |
969.9 |
971.3 |
973.3 |
Block Read – 384KB |
7833.8 |
8600.6 |
9390.4 |
Block Read – 48KB |
8097 |
8877.5 |
9715.4 |
Block Read – 6KB |
14911.3 |
16407.2 |
17904 |
Block Write – 3072KB |
398.7 |
401.5 |
418.1 |
Block Write – 1536KB |
399 |
402.6 |
419 |
Block Write – 384KB |
6572.1 |
7223.7 |
7912.7 |
Block Write – 48KB |
6738 |
7412.9 |
8085.8 |
Block Write – 6KB |
6645.2 |
7309.2 |
7973.1 |
Block Modify – 3072KB |
403.7 |
405.8 |
418.9 |
Block Modify – 1536KB |
403.4 |
406.1 |
420.3 |
Block Modify – 384KB |
4546.6 |
4952 |
5735.8 |
Block Modify – 48KB |
4610.8 |
5056 |
5633.4 |
Block Modify – 6KB |
6367.2 |
7002.9 |
7638.7 |
Random Access – 1536KB |
935.7 |
940.1 |
942.9 |
Random Access – 768KB |
933.7 |
938.2 |
941.7 |
Random Access – 384KB |
5232.3 |
5776.7 |
6264 |
Random Access – 96KB |
5420.1 |
5976.2 |
6471.2 |
Random Access – 48KB |
5432.9 |
6012.9 |
6428.9 |
Random Access – 6KB |
7437.1 |
8096.7 |
9100.9 |
Memory Overall |
3410 |
3551 |
3724 |
These results are very interesting, at least from my point of view. First, we can see that the performance increases very smoothly from 1.4GHz all the way through 2.4GHz when the block size is 6KB, 48KB and 96KB (random access). This is due to the L1 and L2 cache speed increase with clock rate. Since the L1 cache size of the P4 is 8KB, you can see the throughput double between 48KB and 6KB. Note also that all of the large block sizes result in the same throughput regardless of the processor speed, since we are now being limited by the FSB. The third thing to notice is better shown in the following table:
2.0 GHz |
2.0A GHz | |
Block Read – 3072KB |
969.7 |
970.1 |
Block Read – 1536KB |
969.1 |
969.9 |
Block Read – 384KB |
977.9 |
7833.8 |
Block Read – 48KB |
8094.1 |
8097 |
Block Read – 6KB |
14919.2 |
14911.3 |
Block Write – 3072KB |
399.2 |
398.7 |
Block Write – 1536KB |
400.3 |
399 |
Block Write – 384KB |
402.3 |
6572.1 |
Block Write – 48KB |
6737.7 |
6738 |
Block Write – 6KB |
6644.9 |
6645.2 |
Block Modify – 3072KB |
403.6 |
403.7 |
Block Modify – 1536KB |
404.6 |
403.4 |
Block Modify – 384KB |
403 |
4546.6 |
Block Modify – 48KB |
4616.3 |
4610.8 |
Block Modify – 6KB |
6373.3 |
6367.2 |
Random Access – 1536KB |
936 |
935.7 |
Random Access – 768KB |
934 |
933.7 |
Random Access – 384KB |
932.9 |
5232.3 |
Random Access – 96KB |
5404.5 |
5420.1 |
Random Access – 48KB |
5414.1 |
5432.9 |
Random Access – 6KB |
7326.7 |
7437.1 |
Memory Overall |
2660 |
3410 |
This, combined with the CPU scores shown earlier, shows the larger L2 cache of the Northwood processor is responsible for the performance improvement over the Willamette, and little (if anything) else – at least if we can trust the CPU results as being a good measure of all CPU features. OK, so let’s look at the 300MHz P6 processors now:
PIII 300 |
Cel. 300A |
Cel 300 | |
Block Read – 3072KB |
337.2 |
420.1 |
449.1 |
Block Read – 1536KB |
337.2 |
420.8 |
448.9 |
Block Read – 384KB |
909.7 |
420.1 |
448.8 |
Block Read – 48KB |
911.9 |
1038.5 |
450.6 |
Block Read – 6KB |
2216.9 |
2216.6 |
2215.4 |
Block Write – 3072KB |
130.6 |
124.8 |
159.9 |
Block Write – 1536KB |
132.2 |
121 |
159.9 |
Block Write – 384KB |
252.5 |
124.4 |
159.2 |
Block Write – 48KB |
254.3 |
572 |
151.6 |
Block Write – 6KB |
1977.4 |
1977.5 |
1975.7 |
Block Modify – 3072KB |
125.9 |
119.6 |
154.2 |
Block Modify – 1536KB |
127.2 |
121 |
154.2 |
Block Modify – 384KB |
253.2 |
131.9 |
153.2 |
Block Modify – 48KB |
254.2 |
504.9 |
146.9 |
Block Modify – 6KB |
1289.2 |
1288.8 |
1288.2 |
Random Access – 1536KB |
175.7 |
203.2 |
198.5 |
Random Access – 768KB |
175.6 |
203.1 |
198.6 |
Random Access – 384KB |
448.1 |
203.1 |
198.2 |
Random Access – 96KB |
458.4 |
545.6 |
198.7 |
Random Access – 48KB |
459.6 |
570.2 |
198.2 |
Random Access – 6KB |
801.7 |
801.6 |
800.9 |
Memory Overall |
833 |
863 |
916 |
Surprised? I was, until I thought about it a little bit. Since the Celeron 300 has no L2 cache that must be maintained, the L2 cache lookup overhead is eliminated so accesses to system memory is actually faster. From this, we can also see why the streaming instructions in SSE can provide a performance boost when used under the right circumstances. We can also clearly see where the full speed L2 cache provides a benefit, as well as where the larger half speed cache is faster. This latter point is obviously not anything surprising, but it is nice to have a ‘visual’ of this effect.
Pages: « Prev 1 2 3 4 5 6 Next »
Be the first to discuss this article!