Article: Escape From the Planet of x86
By: David Wang (dwang.delete@this.realworldtech.com), June 19, 2003 5:25 pm
Room: Moderated Discussions
Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
---------------------------
>mas (mas769@hotmail.com) on 6/19/03 wrote:
>---------------------------
>>Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
>>---------------------------
>>>Alberto (albertobu@libero.it) on 6/18/03 wrote:
>>>
>>>...
>>>
>>>>If you read:
>>>>http://www.intel.com/design/itanium2/download/14_4_slides_r31_nsn.htm
>>>
>>>One more interesting tidbit that I noticed in that >presentation was the L3 latency
>>>of 14 clock cycles. My recollection is that McKinley's L3 >latency was 12 cycles
>>>(though it may have been more in some situations - ISTR >the range 12 - 15 cycles
>>>being mentioned once): does this indicate a slightly sub->linear improvement in L3 performance for Madison?
>>>
>>
>>Well it all depends on how that offsets against the improvements which are a 50%
>>bandwidth improvement in all the caches and a doubling of the set associativity
>>of the L3 in particular (12->24). My wag is that overall the cache structure has been improved, clock for clock.
>
>A 50% improvement in bandwidth would seem to translate to a 0% improvement 'clock
>for clock'. And ISTR seeing some moderately authoritative source for a rule of
>thumb that 8-way associativity yielded performance close enough to full associativity
>that further effort was not justified (which would seem to make sense, given that
>for the colder data in the cache random replacement works about as well as LRU replacement)
>- so if going to 24-way cost any of that increased latency the trade-off would seem questionable.
>
>However, there's no getting around the fact that at least for the high-end product
>the L3 cache size doubled: that certainly helps on average (though one could argue
>that it would have needed to increase *some* in size just to compensate for the
>fact that memory latency remained the same, so whether it's enough to make overall
>performance increase linearly with the clock rate remains to be seen).
>
>My point was that if the size increase came at the expense of increased latency
>(in terms of clock cycles, not absolute) then it was more of a mixed blessing than would otherwise have been the case.
Size increases always comes at the expense of increased latency. If you want to hang more bits on the same wordline or the same bitline, array access will be slower. If you have more banks/segments/arrays, then getting access to any individual bank/segment/array would take longer.
There's a monkey wrench in this comparison in that there's a process change involved, and there's more than just "more cache" that impacted the (cycle count) latency of the L3 cache. L3 cache is actually faster in wall clock ticks (15 cycles in 1.5 GHz = 10ns, 12 cycles in 1 GHz = 12ns) just didn't get sped up as much as the rest of the chip.
http://cpus.hp.com/technical_references/isscc_2002/isscc_2002_1.shtml
-----------------------------------------------------------
2) The memory system incorporates 3 levels of caching optimized for low latency, high bandwidth and high density respectively. The pre-validated, 4 port 16KB L1D cache [1] is tightly coupled to the integer units to achieve the half cycle load. As a result, the less latency sensitive FPU directly interfaces to the L2D cache [1] with 4 82b load ports (6 cycle latency) and 2 82b store ports. The 3MB, 12 cycle latency L3 cache [1] is implemented with 135 separate "subarrays" that enable high density and the ability to conform to the irregular shape of the processor core with flexible subarray placement. Each level of on-chip cache has matched bandwidths at 32GB/s across the hierarchy (figure 20.6.3).
----------------------------------------------------------
L1 is optimized for latency, L2 optimized for bandwidth, and L3 is optimzied for density. Slightly longer latency for L3 should be a good tradeoff when it gets you the even larger cache.
There are ways to keep the latency of larger caches from increasing in cycle count, but they all involve trading off die area for larger cells/drivers/repeaters/sense units. Since L3 design calls for density optimization, it does not seem to be worth it to trade off area/transistor-count/power to keep latency at the same 12 cycles.
---------------------------
>mas (mas769@hotmail.com) on 6/19/03 wrote:
>---------------------------
>>Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
>>---------------------------
>>>Alberto (albertobu@libero.it) on 6/18/03 wrote:
>>>
>>>...
>>>
>>>>If you read:
>>>>http://www.intel.com/design/itanium2/download/14_4_slides_r31_nsn.htm
>>>
>>>One more interesting tidbit that I noticed in that >presentation was the L3 latency
>>>of 14 clock cycles. My recollection is that McKinley's L3 >latency was 12 cycles
>>>(though it may have been more in some situations - ISTR >the range 12 - 15 cycles
>>>being mentioned once): does this indicate a slightly sub->linear improvement in L3 performance for Madison?
>>>
>>
>>Well it all depends on how that offsets against the improvements which are a 50%
>>bandwidth improvement in all the caches and a doubling of the set associativity
>>of the L3 in particular (12->24). My wag is that overall the cache structure has been improved, clock for clock.
>
>A 50% improvement in bandwidth would seem to translate to a 0% improvement 'clock
>for clock'. And ISTR seeing some moderately authoritative source for a rule of
>thumb that 8-way associativity yielded performance close enough to full associativity
>that further effort was not justified (which would seem to make sense, given that
>for the colder data in the cache random replacement works about as well as LRU replacement)
>- so if going to 24-way cost any of that increased latency the trade-off would seem questionable.
>
>However, there's no getting around the fact that at least for the high-end product
>the L3 cache size doubled: that certainly helps on average (though one could argue
>that it would have needed to increase *some* in size just to compensate for the
>fact that memory latency remained the same, so whether it's enough to make overall
>performance increase linearly with the clock rate remains to be seen).
>
>My point was that if the size increase came at the expense of increased latency
>(in terms of clock cycles, not absolute) then it was more of a mixed blessing than would otherwise have been the case.
Size increases always comes at the expense of increased latency. If you want to hang more bits on the same wordline or the same bitline, array access will be slower. If you have more banks/segments/arrays, then getting access to any individual bank/segment/array would take longer.
There's a monkey wrench in this comparison in that there's a process change involved, and there's more than just "more cache" that impacted the (cycle count) latency of the L3 cache. L3 cache is actually faster in wall clock ticks (15 cycles in 1.5 GHz = 10ns, 12 cycles in 1 GHz = 12ns) just didn't get sped up as much as the rest of the chip.
http://cpus.hp.com/technical_references/isscc_2002/isscc_2002_1.shtml
-----------------------------------------------------------
2) The memory system incorporates 3 levels of caching optimized for low latency, high bandwidth and high density respectively. The pre-validated, 4 port 16KB L1D cache [1] is tightly coupled to the integer units to achieve the half cycle load. As a result, the less latency sensitive FPU directly interfaces to the L2D cache [1] with 4 82b load ports (6 cycle latency) and 2 82b store ports. The 3MB, 12 cycle latency L3 cache [1] is implemented with 135 separate "subarrays" that enable high density and the ability to conform to the irregular shape of the processor core with flexible subarray placement. Each level of on-chip cache has matched bandwidths at 32GB/s across the hierarchy (figure 20.6.3).
----------------------------------------------------------
L1 is optimized for latency, L2 optimized for bandwidth, and L3 is optimzied for density. Slightly longer latency for L3 should be a good tradeoff when it gets you the even larger cache.
There are ways to keep the latency of larger caches from increasing in cycle count, but they all involve trading off die area for larger cells/drivers/repeaters/sense units. Since L3 design calls for density optimization, it does not seem to be worth it to trade off area/transistor-count/power to keep latency at the same 12 cycles.
Topic | Posted By | Date |
---|---|---|
New Silicon Insider Article | David Kanter | 2003/06/17 03:39 PM |
Srockholm Syndrome | anonymous | 2003/06/17 03:50 PM |
Srockholm Syndrome | Nate Begeman | 2003/06/17 04:32 PM |
Srockholm Syndrome | anonymous | 2003/06/18 02:23 PM |
Srockholm Syndrome | Scott Robinson | 2003/06/20 08:25 AM |
New Silicon Insider Article | Bill Todd | 2003/06/17 09:51 PM |
New Silicon Insider Article | Alberto | 2003/06/18 07:29 AM |
New Silicon Insider Article | José Javier Zarate | 2003/06/18 10:16 AM |
New Silicon Insider Article | Bill Todd | 2003/06/18 03:10 PM |
New Silicon Insider Article | Nate Begeman | 2003/06/18 03:25 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 03:41 PM |
New Silicon Insider Article | Alberto | 2003/06/18 03:58 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 04:04 PM |
New Silicon Insider Article | Alberto | 2003/06/18 04:24 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 04:32 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/18 04:13 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 04:23 PM |
New Silicon Insider Article | mas | 2003/06/18 04:11 PM |
New Silicon Insider Article | Alberto | 2003/06/18 03:45 PM |
New Silicon Insider Article | Bill Todd | 2003/06/18 11:46 PM |
New Silicon Insider Article | David Wang | 2003/06/19 12:13 AM |
New Silicon Insider Article | Bill Todd | 2003/06/19 01:14 AM |
New Silicon Insider Article | David Wang | 2003/06/19 10:52 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/18 04:04 PM |
New Silicon Insider Article | Bill Todd | 2003/06/18 11:28 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/19 12:43 AM |
New Silicon Insider Article | Rob Young | 2003/06/19 10:23 AM |
New Silicon Insider Article | Bill Todd | 2003/06/19 04:53 PM |
New Silicon Insider Article | David Wang | 2003/06/18 11:29 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 12:03 AM |
New Silicon Insider Article | José Javier Zarate | 2003/06/19 05:33 AM |
New Silicon Insider Article | mas | 2003/06/19 06:37 AM |
New Silicon Insider Article | Bill Todd | 2003/06/19 04:40 PM |
New Silicon Insider Article | David Wang | 2003/06/19 05:25 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 06:00 PM |
New Silicon Insider Article | Alberto | 2003/06/19 06:29 PM |
New Silicon Insider Article | Speedy | 2003/06/19 06:48 PM |
New Silicon Insider Article | Alberto | 2003/06/20 04:57 AM |
New Silicon Insider Article | David Wang | 2003/06/19 06:52 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 09:00 PM |
New Silicon Insider Article | Anonymous | 2003/06/20 02:20 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 09:11 AM |
New Silicon Insider Article | Anonymous | 2003/06/22 04:48 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/22 05:49 PM |
New Silicon Insider Article | Vincent Diepeveen | 2003/06/22 06:25 PM |
New Silicon Insider Article | José Javier Zarate | 2003/06/22 07:55 PM |
New Silicon Insider Article | Anonymous | 2003/06/23 09:59 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/19 07:53 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 08:53 PM |
New Silicon Insider Article | David Wang | 2003/06/19 09:08 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 02:28 AM |
New Silicon Insider Article | David Wang | 2003/06/20 11:35 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 12:29 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 07:10 PM |
New Silicon Insider Article | Marc M. | 2003/06/21 06:06 AM |
New Silicon Insider Article | Bill Todd | 2003/06/21 12:07 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 07:01 PM |
New Silicon Insider Article | David Wang | 2003/06/20 07:52 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 08:53 PM |
New Silicon Insider Article | David Wang | 2003/06/20 09:14 PM |
New Silicon Insider Article | Vincent Diepeveen | 2003/06/20 09:52 PM |
New Silicon Insider Article | Marc M. | 2003/06/21 08:16 AM |
New Silicon Insider Article | Vincent Diepeveen | 2003/06/22 05:24 PM |
New Silicon Insider Article | Singh, S.R. | 2003/06/21 04:39 AM |
New Silicon Insider Article | David Wang | 2003/06/21 09:10 AM |
IPF Compilers | Nate Begeman | 2003/06/21 10:10 AM |
IPF Compilers | Paul DeMone | 2003/06/21 10:45 AM |
Use ILP to extract more ILP | Paul DeMone | 2003/06/20 11:48 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 09:06 AM |
New Silicon Insider Article | Singh, S.R. | 2003/06/20 10:41 AM |
New Silicon Insider Article | David Kanter | 2003/06/21 04:34 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/22 03:22 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 06:52 PM |
New Silicon Insider Article | Marc M. | 2003/06/21 08:54 AM |
New Silicon Insider Article | Daniel Gustafsson | 2003/06/19 12:12 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 03:20 PM |
New Silicon Insider Article | Bryan Gregory | 2003/06/20 02:14 PM |
New Silicon Insider Article | mas | 2003/06/20 02:43 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/25 11:29 AM |
New Silicon Insider Article | José Javier Zarate | 2003/06/25 11:43 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/25 11:52 AM |
lol, amazing coincidence :-) (NT) | mas | 2003/06/25 04:15 PM |
New Silicon Insider Article | Yoav | 2015/04/01 04:43 AM |