Article: Parallelism at HotPar 2010
By: Michael S (already5chosen.delete@this.yahoo.com), August 6, 2010 8:33 am
Room: Moderated Discussions
Richard Cownie (tich@pobox.com) on 8/6/10 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 8/5/10 wrote:
>---------------------------
>
>>I prefer the same theory as Gabriele - Nehalem is optimized first and foremost
>>for throughput. Improvements in single-thread performance come either as by-products
>>of enhancements in system architecture (IMC, smart power management with turbo-boost)
>>or due to enhancements in micro-architecture that are orthogonal to latency-vs-throughput
>>trade offs (fast rep movsd, fast unaligned SIMD loads/stores, better loop detector).
>
>Ah, but *why* would you optimize for throughput ? Because
>you know you're targeting a process that will let you
>put 4 or 6 cores on a die rather than just 2. Isn't
>that what's driving the architectural decisions ?
>
>Also it seems to me that the biggest slowdown of per-clock
>single-thread performance in Nehalem is the increase
>in L1 latency from 3 cycles to 4 cycles.
Agree.
> I don't see an
>obvious reason why that helps throughput; my suspicion
>is that it came from detailed experiments on cache design
>and latency and clock speeds for the target process
>(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
>in 32nm would limit clockspeed. So they moved to 4-cycle
>L1 to avoid that bottleneck.
>
May be, 4 cycles L1D cache works at lower voltage?
45nm desktop C2Ds has minimal VID Voltage = 850 mV. 45nm desktop Nehalems - 800 mV, 45 nm server Nehalen - 750 mV.
Lower voltage can be seen as optimization for throughput since it allows to fit more cores into the same TDP.
Also it could have something to do with Nehalem's increased physical address space, which by itself is not related to throughput but certainly should be called optimization for servers, which is closely related to optimizations for throughput.
---------------------------
>Michael S (already5chosen@yahoo.com) on 8/5/10 wrote:
>---------------------------
>
>>I prefer the same theory as Gabriele - Nehalem is optimized first and foremost
>>for throughput. Improvements in single-thread performance come either as by-products
>>of enhancements in system architecture (IMC, smart power management with turbo-boost)
>>or due to enhancements in micro-architecture that are orthogonal to latency-vs-throughput
>>trade offs (fast rep movsd, fast unaligned SIMD loads/stores, better loop detector).
>
>Ah, but *why* would you optimize for throughput ? Because
>you know you're targeting a process that will let you
>put 4 or 6 cores on a die rather than just 2. Isn't
>that what's driving the architectural decisions ?
>
>Also it seems to me that the biggest slowdown of per-clock
>single-thread performance in Nehalem is the increase
>in L1 latency from 3 cycles to 4 cycles.
Agree.
> I don't see an
>obvious reason why that helps throughput; my suspicion
>is that it came from detailed experiments on cache design
>and latency and clock speeds for the target process
>(i.e. 32nm for Nehalem), showing that an 3-cycle L1 cache
>in 32nm would limit clockspeed. So they moved to 4-cycle
>L1 to avoid that bottleneck.
>
May be, 4 cycles L1D cache works at lower voltage?
45nm desktop C2Ds has minimal VID Voltage = 850 mV. 45nm desktop Nehalems - 800 mV, 45 nm server Nehalen - 750 mV.
Lower voltage can be seen as optimization for throughput since it allows to fit more cores into the same TDP.
Also it could have something to do with Nehalem's increased physical address space, which by itself is not related to throughput but certainly should be called optimization for servers, which is closely related to optimizations for throughput.