Article: Parallelism at HotPar 2010
By: Richard Cownie (tich.delete@this.pobox.com), August 6, 2010 8:12 am
Room: Moderated Discussions
Michael S (already5chosen@yahoo.com) on 8/5/10 wrote:
---------------------------
>I prefer the same theory as Gabriele - Nehalem is optimized first and foremost
>for throughput. Improvements in single-thread performance come either as by-products
>of enhancements in system architecture (IMC, smart power management with turbo-boost)
>or due to enhancements in micro-architecture that are orthogonal to latency-vs-throughput
>trade offs (fast rep movsd, fast unaligned SIMD loads/stores, better loop detector).
Ah, but *why* would you optimize for throughput ? Because
you know you're targeting a process that will let you
put 4 or 6 cores on a die rather than just 2. Isn't
that what's driving the architectural decisions ?
Also it seems to me that the biggest slowdown of per-clock
single-thread performance in Nehalem is the increase
in L1 latency from 3 cycles to 4 cycles. I don't see an
obvious reason why that helps throughput; my suspicion
is that it came from detailed experiments on cache design
and latency and clock speeds for the target process
(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
in 32nm would limit clockspeed. So they moved to 4-cycle
L1 to avoid that bottleneck.
---------------------------
>I prefer the same theory as Gabriele - Nehalem is optimized first and foremost
>for throughput. Improvements in single-thread performance come either as by-products
>of enhancements in system architecture (IMC, smart power management with turbo-boost)
>or due to enhancements in micro-architecture that are orthogonal to latency-vs-throughput
>trade offs (fast rep movsd, fast unaligned SIMD loads/stores, better loop detector).
Ah, but *why* would you optimize for throughput ? Because
you know you're targeting a process that will let you
put 4 or 6 cores on a die rather than just 2. Isn't
that what's driving the architectural decisions ?
Also it seems to me that the biggest slowdown of per-clock
single-thread performance in Nehalem is the increase
in L1 latency from 3 cycles to 4 cycles. I don't see an
obvious reason why that helps throughput; my suspicion
is that it came from detailed experiments on cache design
and latency and clock speeds for the target process
(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
in 32nm would limit clockspeed. So they moved to 4-cycle
L1 to avoid that bottleneck.