Article: AMD's Mobile Strategy
By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), January 4, 2012 7:01 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 1/4/12 wrote:
---------------------------
>Paul A. Clayton (paaronclayton@gmail.com) on 1/4/12 wrote:
>---------------------------
[snip using nested page table for tagging]
>What happens if you want to use virtualization then?
Several options would be available. One option would be
changing the VMM to handle replacement itself or to use
paravirtualization over the firmware. Another option
would be to treat the VMM as a mere guest.
AMD probably could not afford either option since it does
not have the power to urge VMM providers to make changes
and AMD probably could not afford the loss in performance
of virtualization (even though page table changes are not
that common).
[snip on-processor chip filter]
>That would require a ton of space, I think the coverage
>ratio is around 6X. So for a 256MB cache, you'd need a
>42MB array for snoop filter entries.
Full tagging for a 256 MiB 16-way 64B block cache of a 64 GiB (one socket, possibly one memory channel) memory would
only require 6 MiB (4 Mi Entries * 12 bits per entry), so
a filter should have much smaller requirements.
[snip]
>I don't think you can fit the tags and snoop filter in the
>L3, which is a real problem for latency and power.
One might move some of the snoop filter onto the L4 chip.
This might be especially appropriate if that chip also
provides coherence links to remote sockets.
[snip]
>IBM has moved away from that entirely on their big systems
>and moved the memory controllers on-die, while still
>having external caches. I suspect there's a reason for
>that.
I thought IBM used on-board buffer chips. This would be
roughly comparable to integrating the buffer chips into
the MCM.
My guess would be that the latency advantage of more direct
access to memory trumped the bandwidth advantage. POWER5
also used on-chip tags, IIRC (which was practical given
IBM's more finely targeted market), so having separate
interfaces might not be so horrible. I think POWER chips
also have a higher pin count than x86 servers, so the
balance point of adding more pins versus sacrificing some
latency might be very different.
[snip]
>I was thinking about the difference cache sizes (for >servera and desktop, and the no L4 variant).
A trivial method of supporting two cache size variants
would be to provide different block sizes for desktop
and server variants--assuming tags were on the processor
chip which may not be desirable anyway.
>>Move it off-chip, reuse page tables, or use L3 data space.
>
>Not enough L3 space for tags and snoop filtering.
It might be practical and even desirable to move some of
the snoop filter into the L4 chip. E.g., a large-grained
filter (which might be much smaller and filter out many
requests) might fit on the processor chip with the L4 chip
providing more precise filtering.
>>Use L4 pins for accessing memory as well.
>
>But then you need another set of pins to get to DRAM,
>which adds more power (and latency).
This should not add more power than an on-board buffer
chip and might actually reduce typical power use even
relative to a similarly integrated buffer chip by avoiding
some memory accesses (and some precharge/activate events in
DRAMS). This power use would also be more broadly
distributed spatially.
[snip]
>Yeah, I think DRAM makes more sense too.
Some degree of hybridization might be sensible (e.g.,
partial or full tags in SRAM), but that is just a guess.
[snip]
>There's also the design resources. Why would you work on
>such a niche product when you could improve future
>Bulldozer or Bobcat cores and help all their products?
If a tightly integrated memory is inevitable, it might be
worthwhile to sacrifice some early benefit for a more
timely introduction of the next step in integration.
Of course, this also adds more risk--which is bad for AMD since it does not have a capital reserve to get through
more lean years.
On the slightly positive side, the design effort might be
mostly independent; i.e., little complexity would be added
to other parts of the design.
>>If the L4 was purely local (which might not be bad for
>>some workloads on 4+ socket systems [e.g., virtualization
>>workloads?] and might be quite decent for most 2 socket
>>system workloads), no snoop filter extension would be
>>necessary.
>
>How do you manage coherency?
I may need to think about this more deeply, but I was
assuming that a local-only L4 cache would be the same as a
statistically faster memory.
[snip]
>Yes, I agree with you there.
With AMD not being able to afford the high fixed cost of
developing MCM technology? Or also that AMD would benefit
from a differentiating factor?
[snip]
>What do you mean by heterogeneous multi-core? Like ARM's
>big.little? Or special purpose accelerators, or what?
I was thinking more like big.LITTLE, but not restricting
the use of the weak cores to low-power modes (ARM has
indicated that simultaneous use will be supported, but the
emphasis has been on alternating use). Some degree of
variety in microarchitecture beyond power efficiency might
be useful, but even limited variation around two basic
microarchitectures might be too expensive in design
resources for AMD. However, AMD probably could afford to
maintain two core designs with perhaps a ratio of 3:2 in
count (with more small cores). (It might even be practical
to reserve one small core as a service processor in some
server systems.)
While special purpose accelerators have some efficiency
attraction, such are more difficult to justify for general
purpose systems. Some cryptographic and compression/
decompression acceleration might be sufficiently universal
to justify for all systems above the low-end. I suspect
that integration of some chipset features may have a
justifiable higher priority. Anyway, I was only thinking
about same ISA cores.
---------------------------
>Paul A. Clayton (paaronclayton@gmail.com) on 1/4/12 wrote:
>---------------------------
[snip using nested page table for tagging]
>What happens if you want to use virtualization then?
Several options would be available. One option would be
changing the VMM to handle replacement itself or to use
paravirtualization over the firmware. Another option
would be to treat the VMM as a mere guest.
AMD probably could not afford either option since it does
not have the power to urge VMM providers to make changes
and AMD probably could not afford the loss in performance
of virtualization (even though page table changes are not
that common).
[snip on-processor chip filter]
>That would require a ton of space, I think the coverage
>ratio is around 6X. So for a 256MB cache, you'd need a
>42MB array for snoop filter entries.
Full tagging for a 256 MiB 16-way 64B block cache of a 64 GiB (one socket, possibly one memory channel) memory would
only require 6 MiB (4 Mi Entries * 12 bits per entry), so
a filter should have much smaller requirements.
[snip]
>I don't think you can fit the tags and snoop filter in the
>L3, which is a real problem for latency and power.
One might move some of the snoop filter onto the L4 chip.
This might be especially appropriate if that chip also
provides coherence links to remote sockets.
[snip]
>IBM has moved away from that entirely on their big systems
>and moved the memory controllers on-die, while still
>having external caches. I suspect there's a reason for
>that.
I thought IBM used on-board buffer chips. This would be
roughly comparable to integrating the buffer chips into
the MCM.
My guess would be that the latency advantage of more direct
access to memory trumped the bandwidth advantage. POWER5
also used on-chip tags, IIRC (which was practical given
IBM's more finely targeted market), so having separate
interfaces might not be so horrible. I think POWER chips
also have a higher pin count than x86 servers, so the
balance point of adding more pins versus sacrificing some
latency might be very different.
[snip]
>I was thinking about the difference cache sizes (for >servera and desktop, and the no L4 variant).
A trivial method of supporting two cache size variants
would be to provide different block sizes for desktop
and server variants--assuming tags were on the processor
chip which may not be desirable anyway.
>>Move it off-chip, reuse page tables, or use L3 data space.
>
>Not enough L3 space for tags and snoop filtering.
It might be practical and even desirable to move some of
the snoop filter into the L4 chip. E.g., a large-grained
filter (which might be much smaller and filter out many
requests) might fit on the processor chip with the L4 chip
providing more precise filtering.
>>Use L4 pins for accessing memory as well.
>
>But then you need another set of pins to get to DRAM,
>which adds more power (and latency).
This should not add more power than an on-board buffer
chip and might actually reduce typical power use even
relative to a similarly integrated buffer chip by avoiding
some memory accesses (and some precharge/activate events in
DRAMS). This power use would also be more broadly
distributed spatially.
[snip]
>Yeah, I think DRAM makes more sense too.
Some degree of hybridization might be sensible (e.g.,
partial or full tags in SRAM), but that is just a guess.
[snip]
>There's also the design resources. Why would you work on
>such a niche product when you could improve future
>Bulldozer or Bobcat cores and help all their products?
If a tightly integrated memory is inevitable, it might be
worthwhile to sacrifice some early benefit for a more
timely introduction of the next step in integration.
Of course, this also adds more risk--which is bad for AMD since it does not have a capital reserve to get through
more lean years.
On the slightly positive side, the design effort might be
mostly independent; i.e., little complexity would be added
to other parts of the design.
>>If the L4 was purely local (which might not be bad for
>>some workloads on 4+ socket systems [e.g., virtualization
>>workloads?] and might be quite decent for most 2 socket
>>system workloads), no snoop filter extension would be
>>necessary.
>
>How do you manage coherency?
I may need to think about this more deeply, but I was
assuming that a local-only L4 cache would be the same as a
statistically faster memory.
[snip]
>Yes, I agree with you there.
With AMD not being able to afford the high fixed cost of
developing MCM technology? Or also that AMD would benefit
from a differentiating factor?
[snip]
>What do you mean by heterogeneous multi-core? Like ARM's
>big.little? Or special purpose accelerators, or what?
I was thinking more like big.LITTLE, but not restricting
the use of the weak cores to low-power modes (ARM has
indicated that simultaneous use will be supported, but the
emphasis has been on alternating use). Some degree of
variety in microarchitecture beyond power efficiency might
be useful, but even limited variation around two basic
microarchitectures might be too expensive in design
resources for AMD. However, AMD probably could afford to
maintain two core designs with perhaps a ratio of 3:2 in
count (with more small cores). (It might even be practical
to reserve one small core as a service processor in some
server systems.)
While special purpose accelerators have some efficiency
attraction, such are more difficult to justify for general
purpose systems. Some cryptographic and compression/
decompression acceleration might be sufficiently universal
to justify for all systems above the low-end. I suspect
that integration of some chipset features may have a
justifiable higher priority. Anyway, I was only thinking
about same ISA cores.