A few thoughts on Ampere's Altra Max

By: Björn Ragnar Björnsson (bjorn.ragnar.delete@this.gmail.com), May 27, 2021 3:47 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 27, 2021 12:00 pm wrote:
> Since there are four times as many cores as snoop filter home nodes, I would have hoped for the possibility
> of sharing L2 capacity within a cluster of four cores. Such would have required substantial design
> effort (placement and victim selection is challenging, migration and replication might be even more
> difficult to manage well) and the benefit would only come when either cores are underutilized —
> a condition cloud vendors wish to avoid — and some threads benefit from larger L2 or when a workload
> has multiple active threads that share memory content or have imbalanced memory use. Even though
> nothing-shared is the cloud ideal, I suspect two active threads is not rare.
> It may be possible for locally shared memory to be assigned a common home node so that the snoop filter
> could quickly identify a hit in an L2 within the 4-core cluster. However, the straightforward, relatively
> transparent solution pads physical addresses by six bits (five if one node ID is reserved as only local
> [all accesses from the four cores in that node have a local home node] or as never locality-optimized
> [always use the non-extended address to find the home node]). Such seems expensive for likely little benefit,
> and managing home node assignments in "hardware" (possibly an invisible hypervisor).
> The modest System Level Cache could be useful for prefetching. With only eight memory channels and
> moderate per-core cache capacity, opportunistic prefetching might better utilize bandwidth and reduce
> latency (not only exploiting imbalanced utilization but possibly taking advantage of DRAM row-hits).
> Such could also be useful for writeback scheduling. (Both of these, especially writeback scheduling,
> imply "memory-side" caching.) As already mentioned, I/O caching (storage, network, and accelerator)
> could be very useful; in addition to reducing latency and memory bandwidth, such might modestly reduce
> coherence traffic by avoiding probes of the snoop filters when data is in the System Level Cache. I
> suspect Ampere is not doing anything extraordinarily clever in terms of bypassing (incoming and outgoing)
> and replacement policies, but the silence in the product brief on such (and on associativity) does
> not mean much as product briefs do not typically target those seeking such details.
> The use of the same connections for PCIe or the optional connection to another socket seems nice. Some might
> prefer more expensive two-socket-capable chips with equal per-socket I/O bandwidth, but for cloud operations
> I suspect sharing memory capacity is usually more important than I/O bandwidth. If HBM interfaces are sufficiently
> similar to PCIe, perhaps a future version could provide optional conversion of PCIe lanes to HBM channels for
> extra memory bandwidth (and capacity — in theory a buffer chip could pass commands and data out to other
> memory components, probably to a similarly narrow channel buffer chip to avoid expensive high-pad-count buffer
> chips). Being able to trade I/O bandwidth for shared memory capacity (two sockets) or additional memory capacity
> and (especially) bandwidth when choosing a motherboard seems potentially useful.
> As a multithreading bigot, I am disappointed that no multithreading is supported. With relatively lower
> performance (smaller) cores, more cores may be a better choice, especially with a more limited design budget.
> Such may also be more attractive to smaller customers as it avoids the pressure to tune yet one more knob
> (and restrict sharing with respect to side channels as well as communicate such complexities to customers).
> I think well-designed multithreading support provides useful configurability, but I also have affection
> for heterogeneous multiprocessors, non-uniform cache access, and other quirky features.

Looking at the data sheets for Q80-33 and the Altra Max it seems clear that the Max is basically the exact same animal as the Q80-33. The tradeoffs made are more cores for less clock speed and and less cache and conceivably a slightly larger die for the Max although I've seen nothing to indicate that. The Q80-33 is certainly impressive but it's sweet spot is slightly different from the Max.

As such the Altra Max is a blip compared to the impressive show the Q80-33 has put on. The Max's forte is, at a guess, providing additional flexibility for cloud services if their administration procedures/software are up to fine tuning such things.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Ampere Altra Max 16MB LLC with 128 coresGanon2021/05/25 01:30 AM
  Ampere Altra Max 16MB LLC with 128 coresanon2021/05/25 03:11 AM
    Ampere Altra Max 16MB LLC with 128 coresHeikki Kultala2021/05/25 11:22 PM
      Ampere Altra Max 16MB LLC with 128 coresAnon2021/05/26 01:36 PM
        Ampere Altra Max 16MB LLC with 128 coresChester2021/05/26 02:54 PM
          Ampere Altra Max 16MB LLC with 128 coresChester2021/05/26 03:03 PM
  Ampere Altra Max 16MB LLC with 128 coresDoug S2021/05/25 07:50 AM
    Ampere Altra Max 16MB LLC with 128 coresAndrei F2021/05/25 08:06 AM
    Ampere Altra Max 16MB LLC with 128 coresRayla2021/05/25 08:17 AM
  A few thoughts on Ampere's Altra MaxPaul A. Clayton2021/05/27 12:00 PM
    A few thoughts on Ampere's Altra MaxBjörn Ragnar Björnsson2021/05/27 03:47 PM
      Yeah, I should have looked for and through a data sheet (NT)Paul A. Clayton2021/05/27 06:25 PM
    A few thoughts on Ampere's Altra MaxAdrian2021/05/27 11:13 PM
      Boring can be profitablePaul A. Clayton2021/05/29 12:18 PM
Reply to this Topic
Body: No Text
How do you spell tangerine? 🍊