A few thoughts on Ampere's Altra Max

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), May 27, 2021 12:00 pm
Room: Moderated Discussions
Since there are four times as many cores as snoop filter home nodes, I would have hoped for the possibility of sharing L2 capacity within a cluster of four cores. Such would have required substantial design effort (placement and victim selection is challenging, migration and replication might be even more difficult to manage well) and the benefit would only come when either cores are underutilized — a condition cloud vendors wish to avoid — and some threads benefit from larger L2 or when a workload has multiple active threads that share memory content or have imbalanced memory use. Even though nothing-shared is the cloud ideal, I suspect two active threads is not rare.

It may be possible for locally shared memory to be assigned a common home node so that the snoop filter could quickly identify a hit in an L2 within the 4-core cluster. However, the straightforward, relatively transparent solution pads physical addresses by six bits (five if one node ID is reserved as only local [all accesses from the four cores in that node have a local home node] or as never locality-optimized [always use the non-extended address to find the home node]). Such seems expensive for likely little benefit, and managing home node assignments in "hardware" (possibly an invisible hypervisor).

The modest System Level Cache could be useful for prefetching. With only eight memory channels and moderate per-core cache capacity, opportunistic prefetching might better utilize bandwidth and reduce latency (not only exploiting imbalanced utilization but possibly taking advantage of DRAM row-hits). Such could also be useful for writeback scheduling. (Both of these, especially writeback scheduling, imply "memory-side" caching.) As already mentioned, I/O caching (storage, network, and accelerator) could be very useful; in addition to reducing latency and memory bandwidth, such might modestly reduce coherence traffic by avoiding probes of the snoop filters when data is in the System Level Cache. I suspect Ampere is not doing anything extraordinarily clever in terms of bypassing (incoming and outgoing) and replacement policies, but the silence in the product brief on such (and on associativity) does not mean much as product briefs do not typically target those seeking such details.

The use of the same connections for PCIe or the optional connection to another socket seems nice. Some might prefer more expensive two-socket-capable chips with equal per-socket I/O bandwidth, but for cloud operations I suspect sharing memory capacity is usually more important than I/O bandwidth. If HBM interfaces are sufficiently similar to PCIe, perhaps a future version could provide optional conversion of PCIe lanes to HBM channels for extra memory bandwidth (and capacity — in theory a buffer chip could pass commands and data out to other memory components, probably to a similarly narrow channel buffer chip to avoid expensive high-pad-count buffer chips). Being able to trade I/O bandwidth for shared memory capacity (two sockets) or additional memory capacity and (especially) bandwidth when choosing a motherboard seems potentially useful.

As a multithreading bigot, I am disappointed that no multithreading is supported. With relatively lower performance (smaller) cores, more cores may be a better choice, especially with a more limited design budget. Such may also be more attractive to smaller customers as it avoids the pressure to tune yet one more knob (and restrict sharing with respect to side channels as well as communicate such complexities to customers). I think well-designed multithreading support provides useful configurability, but I also have affection for heterogeneous multiprocessors, non-uniform cache access, and other quirky features.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Ampere Altra Max 16MB LLC with 128 coresGanon2021/05/25 01:30 AM
  Ampere Altra Max 16MB LLC with 128 coresanon2021/05/25 03:11 AM
    Ampere Altra Max 16MB LLC with 128 coresHeikki Kultala2021/05/25 11:22 PM
      Ampere Altra Max 16MB LLC with 128 coresAnon2021/05/26 01:36 PM
        Ampere Altra Max 16MB LLC with 128 coresChester2021/05/26 02:54 PM
          Ampere Altra Max 16MB LLC with 128 coresChester2021/05/26 03:03 PM
  Ampere Altra Max 16MB LLC with 128 coresDoug S2021/05/25 07:50 AM
    Ampere Altra Max 16MB LLC with 128 coresAndrei F2021/05/25 08:06 AM
    Ampere Altra Max 16MB LLC with 128 coresRayla2021/05/25 08:17 AM
  A few thoughts on Ampere's Altra MaxPaul A. Clayton2021/05/27 12:00 PM
    A few thoughts on Ampere's Altra MaxBjörn Ragnar Björnsson2021/05/27 03:47 PM
      Yeah, I should have looked for and through a data sheet (NT)Paul A. Clayton2021/05/27 06:25 PM
    A few thoughts on Ampere's Altra MaxAdrian2021/05/27 11:13 PM
      Boring can be profitablePaul A. Clayton2021/05/29 12:18 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊