Article: Parallelism at HotPar 2010
By: Kevin G (kevin.delete@this.cubitdesigns.com), August 7, 2010 10:14 am
Room: Moderated Discussions
Mark Roulo (nothanks@xxx.com) on 8/6/10 wrote:
---------------------------
>Richard Cownie (tich@pobox.com) on 8/6/10 wrote:
>---------------------------
>>Also it seems to me that the biggest slowdown of per-clock
>>single-thread performance in Nehalem is the increase
>>in L1 latency from 3 cycles to 4 cycles. I don't see an
>>obvious reason why that helps throughput; my suspicion
>>is that it came from detailed experiments on cache design
>>and latency and clock speeds for the target process
>>(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
>>in 32nm would limit clockspeed. So they moved to 4-cycle
>>L1 to avoid that bottleneck.
>>
>
>Wouldn't one reason to increase the L1 cache latency be to enable much faster clock
>speeds if necessary? I've wondered if the 3->4 change was a hedge against needing
>to crank the clock for single threaded performance. If the cost for the current
>parts was low, this might make a lot of sense.
>
>But I don't know enough about how clock speed and cache latencies intersect.
>
>-Mark Roulo
The odd thing is that the Pentium 4 had a lower L1 cache latency and it was designed for far higher clocks. Though in fairness the L1 cache sizes on the Pentium 4 were smaller.
My personal suspecision for the increase in L1 latencies on Nehalem is due to the addition of Hyperthreading and the additional logic necessary for sharing the caches between two threads. Also did Nehalem goes from a 6 transistor SRAM to an 8 transistor SRAM design for the L1 cache?
---------------------------
>Richard Cownie (tich@pobox.com) on 8/6/10 wrote:
>---------------------------
>>Also it seems to me that the biggest slowdown of per-clock
>>single-thread performance in Nehalem is the increase
>>in L1 latency from 3 cycles to 4 cycles. I don't see an
>>obvious reason why that helps throughput; my suspicion
>>is that it came from detailed experiments on cache design
>>and latency and clock speeds for the target process
>>(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
>>in 32nm would limit clockspeed. So they moved to 4-cycle
>>L1 to avoid that bottleneck.
>>
>
>Wouldn't one reason to increase the L1 cache latency be to enable much faster clock
>speeds if necessary? I've wondered if the 3->4 change was a hedge against needing
>to crank the clock for single threaded performance. If the cost for the current
>parts was low, this might make a lot of sense.
>
>But I don't know enough about how clock speed and cache latencies intersect.
>
>-Mark Roulo
The odd thing is that the Pentium 4 had a lower L1 cache latency and it was designed for far higher clocks. Though in fairness the L1 cache sizes on the Pentium 4 were smaller.
My personal suspecision for the increase in L1 latencies on Nehalem is due to the addition of Hyperthreading and the additional logic necessary for sharing the caches between two threads. Also did Nehalem goes from a 6 transistor SRAM to an 8 transistor SRAM design for the L1 cache?