Article: AMD's Mobile Strategy
By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), December 16, 2011 10:45 am
Room: Moderated Discussions
Linus Torvalds (torvalds@linux-foundation.org) on 12/16/11 wrote:
---------------------------
[snip]
>Did anybody ever figure out the logic behind sharing a
>decoder for two cores? That's just crazy. It's SMT without
>most of the area advantages. I bet two simpler decoders
>would have been way more efficient. Backed up by the fact
>that IPC has actually gone down for AMD.
For a speed demon, one wants a smaller L1 Dcache. With
SMT sharing, the miss rate would likely increase too much.
Therefore, one wants separate L1 Dcaches for each thread.
However, AGUs and even ALUs want to be tightly tied to the
L1 Dcache, so putting such together into separate clusters
makes sense.
The sharing of decoders (as well as SIMD which would tend
to be more tolerant of Dcache latency) for module-based multithreading makes roughly as much sense as doing so for
SMT. The decoders and microcode store probably use a
significant fraction of the core's area/power. (Locally
sharing special case features that are moderately used or
latency critical could make sense.)
Choosing a speed demon microarchitecture seems to be the
main mistake--and may have increased the cache latency
issues.
---------------------------
[snip]
>Did anybody ever figure out the logic behind sharing a
>decoder for two cores? That's just crazy. It's SMT without
>most of the area advantages. I bet two simpler decoders
>would have been way more efficient. Backed up by the fact
>that IPC has actually gone down for AMD.
For a speed demon, one wants a smaller L1 Dcache. With
SMT sharing, the miss rate would likely increase too much.
Therefore, one wants separate L1 Dcaches for each thread.
However, AGUs and even ALUs want to be tightly tied to the
L1 Dcache, so putting such together into separate clusters
makes sense.
The sharing of decoders (as well as SIMD which would tend
to be more tolerant of Dcache latency) for module-based multithreading makes roughly as much sense as doing so for
SMT. The decoders and microcode store probably use a
significant fraction of the core's area/power. (Locally
sharing special case features that are moderately used or
latency critical could make sense.)
Choosing a speed demon microarchitecture seems to be the
main mistake--and may have increased the cache latency
issues.