Article: Parallelism at HotPar 2010
By: Mark Roulo (nothanks.delete@this.xxx.com), August 6, 2010 8:36 am
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/6/10 wrote:
---------------------------
>Also it seems to me that the biggest slowdown of per-clock
>single-thread performance in Nehalem is the increase
>in L1 latency from 3 cycles to 4 cycles. I don't see an
>obvious reason why that helps throughput; my suspicion
>is that it came from detailed experiments on cache design
>and latency and clock speeds for the target process
>(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
>in 32nm would limit clockspeed. So they moved to 4-cycle
>L1 to avoid that bottleneck.
>
Wouldn't one reason to increase the L1 cache latency be to enable much faster clock speeds if necessary? I've wondered if the 3->4 change was a hedge against needing to crank the clock for single threaded performance. If the cost for the current parts was low, this might make a lot of sense.
But I don't know enough about how clock speed and cache latencies intersect.
-Mark Roulo
---------------------------
>Also it seems to me that the biggest slowdown of per-clock
>single-thread performance in Nehalem is the increase
>in L1 latency from 3 cycles to 4 cycles. I don't see an
>obvious reason why that helps throughput; my suspicion
>is that it came from detailed experiments on cache design
>and latency and clock speeds for the target process
>(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
>in 32nm would limit clockspeed. So they moved to 4-cycle
>L1 to avoid that bottleneck.
>
Wouldn't one reason to increase the L1 cache latency be to enable much faster clock speeds if necessary? I've wondered if the 3->4 change was a hedge against needing to crank the clock for single threaded performance. If the cost for the current parts was low, this might make a lot of sense.
But I don't know enough about how clock speed and cache latencies intersect.
-Mark Roulo