Article: AMD's Mobile Strategy
By: David Kanter (dkanter.delete@this.realworldtech.com), January 4, 2012 3:59 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton@gmail.com) on 1/4/12 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/3/12 wrote:
>---------------------------
>[snip]
>>The economics don't make sense. As I mentioned in another
>>post, that means you need:
>>
>>1. An L4 cache controller
>
>This problem could be addressed by mapping the L4 cache as
>a region of memory and using the page table as tags. With
>nested page tables this could probably even be made
>transparent to the OS if desired. With a 256 MiB L4, 4 KiB
>blocks might not be horrible. The software managing the
>cache might run on one reserved thread of a modest-
>performance MT core.
What happens if you want to use virtualization then?
>A perhaps more attractive option could be to include the
>cache controller with the cache memory. To allow remote
>memory to be cached in L4, something like a snoop filter
>might be used to filter accesses to the L4.
That would require a ton of space, I think the coverage ratio is around 6X. So for a 256MB cache, you'd need a 42MB array for snoop filter entries.
That's already more cache than in the L3.
>It might also be possible to use a section of L3 cache for
>L4 tags rather than L3 data. If the overhead of snoop
>filter logic is acceptable, then the overhead of handling
>tag matching et al. is likely to be acceptable.
I don't think you can fit the tags and snoop filter in the L3, which is a real problem for latency and power.
>>2. Pins to connect to the L4 cache, with sufficient
>>bandwidth to handle snooping traffic in servers
>
>If all versions used an auxiliary chip in a MCM, the
>normal memory access path could pass through the
>auxiliary chip. This could reduce die area on the
>processor chip (less power to drive MCM-internal
>communication) and even increase available bandwidth.
IBM has moved away from that entirely on their big systems and moved the memory controllers on-die, while still having external caches. I suspect there's a reason for that.
>>The controller and pins use up substantial extra die area >and power. So are you going to have a separate CPU die for
>>low-end desktops (no L4), high-end desktops (small L4) and
>>servers (big L4)?
>
>If low-end desktops use a different (Bobcat-like)
>processor, then having the middle-and-above processors use
>an interface to an auxiliary chip would not seem to be a
>major problem.
>
>>If so, now your validation is 3X worse because you have 3
>>models. And each separate die will need masks (probably
>>$1-2M/each).
>
>Even with three different mask sets, this would not
>increase validation costs three fold.
I was thinking about the difference cache sizes (for servera and desktop, and the no L4 variant).
>>If you use the same die, then you are wasting significant
>>power and area on the high volume (low-end desktops), to
>>improve things for relatively low volume parts (high-end
>>desktop and server).
>
>As noted above, the die area costs might be reduced even if
>a single mask set was used for L4 and no-L4 versions. For
>single-socket systems, a local-only L4 might not be
>horrible.
>
>>How many of these do you think AMD can sell, and how much
>>do you think they can increase their prices by?
>
>Unfortunately, I do not think AMD is positioned to lead in
>the re-introduction of off-chip cache. The learning curve
>for MCM fabrication might excessively penalize their
>smaller production volume.
>
>>Let's just recap the costs:
>>
>>1. More die area for L4 controller
>
>Move it off-chip, reuse page tables, or use L3 data space.
Not enough L3 space for tags and snoop filtering.
>>2. More die area for pins
>
>Use L4 pins for accessing memory as well.
But then you need another set of pins to get to DRAM, which adds more power (and latency).
>>3. More validation for different models
>
>Minimize variation and limit risk by multipurposing (e.g.,
>using L3 data space for tags might allow a broken or
>unvalidated version to ship [like the Pentium4's
>hyperthreading?]).
>
>>4. External SRAM chips (how much would a 32MB SRAM cost?)
>
>It is not clear that pure SRAM would be better than DRAM or
>a DRAM/SRAM hybrid in terms of capacity and power
>efficiency.
Yeah, I think DRAM makes more sense too.
>>5. More complex packaging, lower total yields
>
>I think this factor kills such for AMD. I think the
>latency and bandwidth walls are going to require a tightly
>integrated off-chip cache with memory pass-through (for
>better bandwidth and perhaps better power distribution--
>the higher power interfaces to DIMMs could be farther
>separated from the power-intensive processing elements).
There's also the design resources. Why would you work on such a niche product when you could improve future Bulldozer or Bobcat cores and help all their products?
>>6. Need to design a new snoop filter to deal with larger
>>cache sizes in servers
>
>A new snoop filter design will be needed anyway, right?
>
>If the L4 was purely local (which might not be bad for
>some workloads on 4+ socket systems [e.g., virtualization
>workloads?] and might be quite decent for most 2 socket
>system workloads), no snoop filter extension would be
>necessary.
How do you manage coherency?
>It might also be practical to include additional coherence
>and I/O interfaces on the L4 chips, facilitating a larger
>socket count and/or more aggressive I/O.
>
>>Those costs are pretty significant.
>>
>>I'm also very skeptical that the additional performance
>>will be enough to raise ASPs higher than the extra costs.
>>In other words, I suspect it would make AMD less >profitable.
>
>While I agree with the idea that AMD would benefit from a
>differentiating factor (Bobcat--a better low-end processor
>than Atom--and Fusion--exploiting GPU expertise--seem to be
>good steps toward such a strategy), I doubt AMD could
>overcome the costs to develop cost-effective MCM
>technology.
Yes, I agree with you there.
>A more attractive alternative (which Intel seems to be
>ignoring) would seem to be heterogeneous multicore. AMD
>might also be able to exploit the fact that it now uses
>generally available fabrication to sell hard core designs
>or possibly other smaller design elements, though I tend to
>doubt that much market exists for most of AMD's designs.
What do you mean by heterogeneous multi-core? Like ARM's big.little? Or special purpose accelerators, or what?
David
---------------------------
>David Kanter (dkanter@realworldtech.com) on 1/3/12 wrote:
>---------------------------
>[snip]
>>The economics don't make sense. As I mentioned in another
>>post, that means you need:
>>
>>1. An L4 cache controller
>
>This problem could be addressed by mapping the L4 cache as
>a region of memory and using the page table as tags. With
>nested page tables this could probably even be made
>transparent to the OS if desired. With a 256 MiB L4, 4 KiB
>blocks might not be horrible. The software managing the
>cache might run on one reserved thread of a modest-
>performance MT core.
What happens if you want to use virtualization then?
>A perhaps more attractive option could be to include the
>cache controller with the cache memory. To allow remote
>memory to be cached in L4, something like a snoop filter
>might be used to filter accesses to the L4.
That would require a ton of space, I think the coverage ratio is around 6X. So for a 256MB cache, you'd need a 42MB array for snoop filter entries.
That's already more cache than in the L3.
>It might also be possible to use a section of L3 cache for
>L4 tags rather than L3 data. If the overhead of snoop
>filter logic is acceptable, then the overhead of handling
>tag matching et al. is likely to be acceptable.
I don't think you can fit the tags and snoop filter in the L3, which is a real problem for latency and power.
>>2. Pins to connect to the L4 cache, with sufficient
>>bandwidth to handle snooping traffic in servers
>
>If all versions used an auxiliary chip in a MCM, the
>normal memory access path could pass through the
>auxiliary chip. This could reduce die area on the
>processor chip (less power to drive MCM-internal
>communication) and even increase available bandwidth.
IBM has moved away from that entirely on their big systems and moved the memory controllers on-die, while still having external caches. I suspect there's a reason for that.
>>The controller and pins use up substantial extra die area >and power. So are you going to have a separate CPU die for
>>low-end desktops (no L4), high-end desktops (small L4) and
>>servers (big L4)?
>
>If low-end desktops use a different (Bobcat-like)
>processor, then having the middle-and-above processors use
>an interface to an auxiliary chip would not seem to be a
>major problem.
>
>>If so, now your validation is 3X worse because you have 3
>>models. And each separate die will need masks (probably
>>$1-2M/each).
>
>Even with three different mask sets, this would not
>increase validation costs three fold.
I was thinking about the difference cache sizes (for servera and desktop, and the no L4 variant).
>>If you use the same die, then you are wasting significant
>>power and area on the high volume (low-end desktops), to
>>improve things for relatively low volume parts (high-end
>>desktop and server).
>
>As noted above, the die area costs might be reduced even if
>a single mask set was used for L4 and no-L4 versions. For
>single-socket systems, a local-only L4 might not be
>horrible.
>
>>How many of these do you think AMD can sell, and how much
>>do you think they can increase their prices by?
>
>Unfortunately, I do not think AMD is positioned to lead in
>the re-introduction of off-chip cache. The learning curve
>for MCM fabrication might excessively penalize their
>smaller production volume.
>
>>Let's just recap the costs:
>>
>>1. More die area for L4 controller
>
>Move it off-chip, reuse page tables, or use L3 data space.
Not enough L3 space for tags and snoop filtering.
>>2. More die area for pins
>
>Use L4 pins for accessing memory as well.
But then you need another set of pins to get to DRAM, which adds more power (and latency).
>>3. More validation for different models
>
>Minimize variation and limit risk by multipurposing (e.g.,
>using L3 data space for tags might allow a broken or
>unvalidated version to ship [like the Pentium4's
>hyperthreading?]).
>
>>4. External SRAM chips (how much would a 32MB SRAM cost?)
>
>It is not clear that pure SRAM would be better than DRAM or
>a DRAM/SRAM hybrid in terms of capacity and power
>efficiency.
Yeah, I think DRAM makes more sense too.
>>5. More complex packaging, lower total yields
>
>I think this factor kills such for AMD. I think the
>latency and bandwidth walls are going to require a tightly
>integrated off-chip cache with memory pass-through (for
>better bandwidth and perhaps better power distribution--
>the higher power interfaces to DIMMs could be farther
>separated from the power-intensive processing elements).
There's also the design resources. Why would you work on such a niche product when you could improve future Bulldozer or Bobcat cores and help all their products?
>>6. Need to design a new snoop filter to deal with larger
>>cache sizes in servers
>
>A new snoop filter design will be needed anyway, right?
>
>If the L4 was purely local (which might not be bad for
>some workloads on 4+ socket systems [e.g., virtualization
>workloads?] and might be quite decent for most 2 socket
>system workloads), no snoop filter extension would be
>necessary.
How do you manage coherency?
>It might also be practical to include additional coherence
>and I/O interfaces on the L4 chips, facilitating a larger
>socket count and/or more aggressive I/O.
>
>>Those costs are pretty significant.
>>
>>I'm also very skeptical that the additional performance
>>will be enough to raise ASPs higher than the extra costs.
>>In other words, I suspect it would make AMD less >profitable.
>
>While I agree with the idea that AMD would benefit from a
>differentiating factor (Bobcat--a better low-end processor
>than Atom--and Fusion--exploiting GPU expertise--seem to be
>good steps toward such a strategy), I doubt AMD could
>overcome the costs to develop cost-effective MCM
>technology.
Yes, I agree with you there.
>A more attractive alternative (which Intel seems to be
>ignoring) would seem to be heterogeneous multicore. AMD
>might also be able to exploit the fact that it now uses
>generally available fabrication to sell hard core designs
>or possibly other smaller design elements, though I tend to
>doubt that much market exists for most of AMD's designs.
What do you mean by heterogeneous multi-core? Like ARM's big.little? Or special purpose accelerators, or what?
David