Article: Escape From the Planet of x86
By: Bill Todd (billtodd.delete@this.metrocast.net), June 19, 2003 6:00 pm
Room: Moderated Discussions
David Wang (dwang@realworldtech.com) on 6/19/03 wrote:
---------------------------
>Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
>---------------------------
>>mas (mas769@hotmail.com) on 6/19/03 wrote:
>>---------------------------
>>>Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
>>>---------------------------
>>>>Alberto (albertobu@libero.it) on 6/18/03 wrote:
>>>>
>>>>...
>>>>
>>>>>If you read:
>>>>>http://www.intel.com/design/itanium2/download/14_4_slides_r31_nsn.htm
>>>>
>>>>One more interesting tidbit that I noticed in that >presentation was the L3 latency
>>>>of 14 clock cycles. My recollection is that McKinley's L3 >latency was 12 cycles
>>>>(though it may have been more in some situations - ISTR >the range 12 - 15 cycles
>>>>being mentioned once): does this indicate a slightly sub->linear improvement in L3 performance for Madison?
>>>>
>>>
>>>Well it all depends on how that offsets against the improvements which are a 50%
>>>bandwidth improvement in all the caches and a doubling of the set associativity
>>>of the L3 in particular (12->24). My wag is that overall the cache structure has been improved, clock for clock.
>>
>>A 50% improvement in bandwidth would seem to translate to a 0% improvement 'clock
>>for clock'. And ISTR seeing some moderately authoritative source for a rule of
>>thumb that 8-way associativity yielded performance close enough to full associativity
>>that further effort was not justified (which would seem to make sense, given that
>>for the colder data in the cache random replacement works about as well as LRU replacement)
>>- so if going to 24-way cost any of that increased latency the trade-off would seem questionable.
>>
>>However, there's no getting around the fact that at least for the high-end product
>>the L3 cache size doubled: that certainly helps on average (though one could argue
>>that it would have needed to increase *some* in size just to compensate for the
>>fact that memory latency remained the same, so whether it's enough to make overall
>>performance increase linearly with the clock rate remains to be seen).
>>
>>My point was that if the size increase came at the expense of increased latency
>>(in terms of clock cycles, not absolute) then it was more of a mixed blessing than would otherwise have been the case.
>
>Size increases always comes at the expense of increased latency. If you want to
>hang more bits on the same wordline or the same bitline, array access will be slower.
>If you have more banks/segments/arrays, then getting access to any individual bank/segment/array would take longer.
>
>There's a monkey wrench in this comparison in that there's a process change involved,
>and there's more than just "more cache" that impacted the (cycle count) latency
>of the L3 cache. L3 cache is actually faster in wall clock ticks (15 cycles in
>1.5 GHz = 10ns, 12 cycles in 1 GHz = 12ns) just didn't get sped up as much as the rest of the chip.
>
>http://cpus.hp.com/technical_references/isscc_2002/isscc_2002_1.shtml
>
>-----------------------------------------------------------
>
>2) The memory system incorporates 3 levels of caching optimized for low latency,
>high bandwidth and high density respectively. The pre-validated, 4 port 16KB L1D
>cache [1] is tightly coupled to the integer units to achieve the half cycle load.
>As a result, the less latency sensitive FPU directly interfaces to the L2D cache
>[1] with 4 82b load ports (6 cycle latency) and 2 82b store ports. The 3MB, 12 cycle
>latency L3 cache [1] is implemented with 135 separate "subarrays" that enable high
>density and the ability to conform to the irregular shape of the processor core
>with flexible subarray placement. Each level of on-chip cache has matched bandwidths
>at 32GB/s across the hierarchy (figure 20.6.3).
>----------------------------------------------------------
>
>L1 is optimized for latency, L2 optimized for bandwidth, and L3 is optimzied for
>density. Slightly longer latency for L3 should be a good tradeoff when it gets you the even larger cache.
>
>There are ways to keep the latency of larger caches from increasing in cycle count,
>but they all involve trading off die area for larger cells/drivers/repeaters/sense
>units. Since L3 design calls for density optimization, it does not seem to be worth
>it to trade off area/transistor-count/power to keep latency at the same 12 cycles.
All that is fine and dandy, but irrelevant to the question I asked, which was whether L3 latency had scaled rather less than linearly with clock rate. AFAICT the answer is a simple 'yes', though still qualified by my impression that the L3 latency specs for McKinley gave a 12 - 15 cycle range whereas those for Madison seem to give a flat 14 cycle figure.
But we'll see soon enough just how linearly Madison performace scales with clock rate on various benchmarks. It seemed to do pretty well on SPECweb99_SSL, though was aided by use of a newer version of Zeus (and my impression from comments from someone who should know is that that often makes a non-negligible difference). TPC-C was less linear. Unless significant compiler advances have occurred since McKinley's SPECint scores were posted, I'm beginning to suspect that Madison at 1.5 GHz will have difficulty getting much above 1200, despite the doubling in L3 cache size.
- bill
---------------------------
>Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
>---------------------------
>>mas (mas769@hotmail.com) on 6/19/03 wrote:
>>---------------------------
>>>Bill Todd (billtodd@metrocast.net) on 6/19/03 wrote:
>>>---------------------------
>>>>Alberto (albertobu@libero.it) on 6/18/03 wrote:
>>>>
>>>>...
>>>>
>>>>>If you read:
>>>>>http://www.intel.com/design/itanium2/download/14_4_slides_r31_nsn.htm
>>>>
>>>>One more interesting tidbit that I noticed in that >presentation was the L3 latency
>>>>of 14 clock cycles. My recollection is that McKinley's L3 >latency was 12 cycles
>>>>(though it may have been more in some situations - ISTR >the range 12 - 15 cycles
>>>>being mentioned once): does this indicate a slightly sub->linear improvement in L3 performance for Madison?
>>>>
>>>
>>>Well it all depends on how that offsets against the improvements which are a 50%
>>>bandwidth improvement in all the caches and a doubling of the set associativity
>>>of the L3 in particular (12->24). My wag is that overall the cache structure has been improved, clock for clock.
>>
>>A 50% improvement in bandwidth would seem to translate to a 0% improvement 'clock
>>for clock'. And ISTR seeing some moderately authoritative source for a rule of
>>thumb that 8-way associativity yielded performance close enough to full associativity
>>that further effort was not justified (which would seem to make sense, given that
>>for the colder data in the cache random replacement works about as well as LRU replacement)
>>- so if going to 24-way cost any of that increased latency the trade-off would seem questionable.
>>
>>However, there's no getting around the fact that at least for the high-end product
>>the L3 cache size doubled: that certainly helps on average (though one could argue
>>that it would have needed to increase *some* in size just to compensate for the
>>fact that memory latency remained the same, so whether it's enough to make overall
>>performance increase linearly with the clock rate remains to be seen).
>>
>>My point was that if the size increase came at the expense of increased latency
>>(in terms of clock cycles, not absolute) then it was more of a mixed blessing than would otherwise have been the case.
>
>Size increases always comes at the expense of increased latency. If you want to
>hang more bits on the same wordline or the same bitline, array access will be slower.
>If you have more banks/segments/arrays, then getting access to any individual bank/segment/array would take longer.
>
>There's a monkey wrench in this comparison in that there's a process change involved,
>and there's more than just "more cache" that impacted the (cycle count) latency
>of the L3 cache. L3 cache is actually faster in wall clock ticks (15 cycles in
>1.5 GHz = 10ns, 12 cycles in 1 GHz = 12ns) just didn't get sped up as much as the rest of the chip.
>
>http://cpus.hp.com/technical_references/isscc_2002/isscc_2002_1.shtml
>
>-----------------------------------------------------------
>
>2) The memory system incorporates 3 levels of caching optimized for low latency,
>high bandwidth and high density respectively. The pre-validated, 4 port 16KB L1D
>cache [1] is tightly coupled to the integer units to achieve the half cycle load.
>As a result, the less latency sensitive FPU directly interfaces to the L2D cache
>[1] with 4 82b load ports (6 cycle latency) and 2 82b store ports. The 3MB, 12 cycle
>latency L3 cache [1] is implemented with 135 separate "subarrays" that enable high
>density and the ability to conform to the irregular shape of the processor core
>with flexible subarray placement. Each level of on-chip cache has matched bandwidths
>at 32GB/s across the hierarchy (figure 20.6.3).
>----------------------------------------------------------
>
>L1 is optimized for latency, L2 optimized for bandwidth, and L3 is optimzied for
>density. Slightly longer latency for L3 should be a good tradeoff when it gets you the even larger cache.
>
>There are ways to keep the latency of larger caches from increasing in cycle count,
>but they all involve trading off die area for larger cells/drivers/repeaters/sense
>units. Since L3 design calls for density optimization, it does not seem to be worth
>it to trade off area/transistor-count/power to keep latency at the same 12 cycles.
All that is fine and dandy, but irrelevant to the question I asked, which was whether L3 latency had scaled rather less than linearly with clock rate. AFAICT the answer is a simple 'yes', though still qualified by my impression that the L3 latency specs for McKinley gave a 12 - 15 cycle range whereas those for Madison seem to give a flat 14 cycle figure.
But we'll see soon enough just how linearly Madison performace scales with clock rate on various benchmarks. It seemed to do pretty well on SPECweb99_SSL, though was aided by use of a newer version of Zeus (and my impression from comments from someone who should know is that that often makes a non-negligible difference). TPC-C was less linear. Unless significant compiler advances have occurred since McKinley's SPECint scores were posted, I'm beginning to suspect that Madison at 1.5 GHz will have difficulty getting much above 1200, despite the doubling in L3 cache size.
- bill
Topic | Posted By | Date |
---|---|---|
New Silicon Insider Article | David Kanter | 2003/06/17 03:39 PM |
Srockholm Syndrome | anonymous | 2003/06/17 03:50 PM |
Srockholm Syndrome | Nate Begeman | 2003/06/17 04:32 PM |
Srockholm Syndrome | anonymous | 2003/06/18 02:23 PM |
Srockholm Syndrome | Scott Robinson | 2003/06/20 08:25 AM |
New Silicon Insider Article | Bill Todd | 2003/06/17 09:51 PM |
New Silicon Insider Article | Alberto | 2003/06/18 07:29 AM |
New Silicon Insider Article | José Javier Zarate | 2003/06/18 10:16 AM |
New Silicon Insider Article | Bill Todd | 2003/06/18 03:10 PM |
New Silicon Insider Article | Nate Begeman | 2003/06/18 03:25 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 03:41 PM |
New Silicon Insider Article | Alberto | 2003/06/18 03:58 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 04:04 PM |
New Silicon Insider Article | Alberto | 2003/06/18 04:24 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 04:32 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/18 04:13 PM |
New Silicon Insider Article | Tvar' | 2003/06/18 04:23 PM |
New Silicon Insider Article | mas | 2003/06/18 04:11 PM |
New Silicon Insider Article | Alberto | 2003/06/18 03:45 PM |
New Silicon Insider Article | Bill Todd | 2003/06/18 11:46 PM |
New Silicon Insider Article | David Wang | 2003/06/19 12:13 AM |
New Silicon Insider Article | Bill Todd | 2003/06/19 01:14 AM |
New Silicon Insider Article | David Wang | 2003/06/19 10:52 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/18 04:04 PM |
New Silicon Insider Article | Bill Todd | 2003/06/18 11:28 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/19 12:43 AM |
New Silicon Insider Article | Rob Young | 2003/06/19 10:23 AM |
New Silicon Insider Article | Bill Todd | 2003/06/19 04:53 PM |
New Silicon Insider Article | David Wang | 2003/06/18 11:29 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 12:03 AM |
New Silicon Insider Article | José Javier Zarate | 2003/06/19 05:33 AM |
New Silicon Insider Article | mas | 2003/06/19 06:37 AM |
New Silicon Insider Article | Bill Todd | 2003/06/19 04:40 PM |
New Silicon Insider Article | David Wang | 2003/06/19 05:25 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 06:00 PM |
New Silicon Insider Article | Alberto | 2003/06/19 06:29 PM |
New Silicon Insider Article | Speedy | 2003/06/19 06:48 PM |
New Silicon Insider Article | Alberto | 2003/06/20 04:57 AM |
New Silicon Insider Article | David Wang | 2003/06/19 06:52 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 09:00 PM |
New Silicon Insider Article | Anonymous | 2003/06/20 02:20 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 09:11 AM |
New Silicon Insider Article | Anonymous | 2003/06/22 04:48 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/22 05:49 PM |
New Silicon Insider Article | Vincent Diepeveen | 2003/06/22 06:25 PM |
New Silicon Insider Article | José Javier Zarate | 2003/06/22 07:55 PM |
New Silicon Insider Article | Anonymous | 2003/06/23 09:59 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/19 07:53 PM |
New Silicon Insider Article | Bill Todd | 2003/06/19 08:53 PM |
New Silicon Insider Article | David Wang | 2003/06/19 09:08 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 02:28 AM |
New Silicon Insider Article | David Wang | 2003/06/20 11:35 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 12:29 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 07:10 PM |
New Silicon Insider Article | Marc M. | 2003/06/21 06:06 AM |
New Silicon Insider Article | Bill Todd | 2003/06/21 12:07 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 07:01 PM |
New Silicon Insider Article | David Wang | 2003/06/20 07:52 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 08:53 PM |
New Silicon Insider Article | David Wang | 2003/06/20 09:14 PM |
New Silicon Insider Article | Vincent Diepeveen | 2003/06/20 09:52 PM |
New Silicon Insider Article | Marc M. | 2003/06/21 08:16 AM |
New Silicon Insider Article | Vincent Diepeveen | 2003/06/22 05:24 PM |
New Silicon Insider Article | Singh, S.R. | 2003/06/21 04:39 AM |
New Silicon Insider Article | David Wang | 2003/06/21 09:10 AM |
IPF Compilers | Nate Begeman | 2003/06/21 10:10 AM |
IPF Compilers | Paul DeMone | 2003/06/21 10:45 AM |
Use ILP to extract more ILP | Paul DeMone | 2003/06/20 11:48 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 09:06 AM |
New Silicon Insider Article | Singh, S.R. | 2003/06/20 10:41 AM |
New Silicon Insider Article | David Kanter | 2003/06/21 04:34 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/22 03:22 PM |
New Silicon Insider Article | Bill Todd | 2003/06/20 06:52 PM |
New Silicon Insider Article | Marc M. | 2003/06/21 08:54 AM |
New Silicon Insider Article | Daniel Gustafsson | 2003/06/19 12:12 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/20 03:20 PM |
New Silicon Insider Article | Bryan Gregory | 2003/06/20 02:14 PM |
New Silicon Insider Article | mas | 2003/06/20 02:43 PM |
New Silicon Insider Article | Paul DeMone | 2003/06/25 11:29 AM |
New Silicon Insider Article | José Javier Zarate | 2003/06/25 11:43 AM |
New Silicon Insider Article | Paul DeMone | 2003/06/25 11:52 AM |
lol, amazing coincidence :-) (NT) | mas | 2003/06/25 04:15 PM |
New Silicon Insider Article | Yoav | 2015/04/01 04:43 AM |