Boring can be profitable

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), May 29, 2021 11:18 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on May 27, 2021 11:13 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 27, 2021 12:00 pm wrote:
>>
>> As a multithreading bigot, I am disappointed that no multithreading is supported. With relatively lower
>> performance (smaller) cores, more cores may be a better
>> choice, especially with a more limited design budget.
>> Such may also be more attractive to smaller customers as it avoids the pressure to tune yet one more knob
>> (and restrict sharing with respect to side channels as well as communicate such complexities to customers).
>> I think well-designed multithreading support provides useful configurability, but I also have affection
>> for heterogeneous multiprocessors, non-uniform cache access, and other quirky features.
>
> In an interview,
>
> https://www.nextplatform.com/2021/05/24/the-ampere-arm-server-chip-roadmap-may-lead-beyond-hyperscalers/
>
> they have said that a differentiating point against competition in their
> future custom cores will be better isolation between threads.

The product brief spoke of predictable performance, which I took to mean simplifying performance prediction (and reducing the extent and frequency of fast and slow tails). By limiting sharing to within a single protection domain (address space), the side channel issues for sharing do not seem relevant.

For performance predictability, sharing would be problematic. Even with relatively fine-grained allocations (like cache capacity), ordinary allocation policies can easily be sub-optimal with respect to performance. Associativity is more likely to be problematic as conserving power might urge keeping total associativity constant (or nearly so) increasing conflict misses. (Better mapping methods, especially skewed associativity with cuckoo replacement, might make such insignificant.)

With eight memory channels shared among 128 cores, contention seems likely (an upgrade to DDR5 would help this by providing two narrower channels per module-channel). Bandwidth contention might be more important for reducing performance but memory contention will also impact memory latency.

The snoop filters also seem to be shared across all cores (home nodes are selected based on address, I think). This could disrupt performance (two independent workloads with high sharing across cores would tend to affect filter effectiveness compared to a workload with exclusive use of a snoop filter) and provide a side channel.

This reminds me of Azul Systems' design for uniform memory access latency (across multiple sockets, if I recall correctly). While such can reduce worst case latency, for uniformly distributed accesses average latency would probably be worse.

> So given this design target, it is clear that even in their future cores they will not use any kind
> of SMT and they will also not increase cache sharing in any way, but they might decrease it.

Or fine-grained MT or SoEMT.☺

> Sharing resources is good for improving the average throughput, but when security
> is valued above efficiency, then sharing has to be avoided, and this appears
> to be Ampere's choice, in order to not confront directly AMD.

I am not certain how much demand there is for predictable performance or for high security within a protection domain (I proposed MT and L2 sharing within a protection domain). Such might be marketable, but I am skeptical.

(I was under the vague impression that much of cloud service use was for flexible customer demand for online sales and for raw compute for moderate-scale problems where local compute would provide inadequate time-to-solution. Do either of those care about side channels? Online retail might care about predictable performance, but "spin up more instances" seems the typical solution to load handling. I would happily have my ignorance corrected.)

On the other hand, when all one's options are bad (which seems to be the case for independent ARM server vendors right now) doing something different may be less likely to fail (and just blame for failure would be easier to avoid since unusual choices are more difficult to analyze — unjust blame for failure is very hard to avoid).

As an outsider, I enjoy cleverness over marketability and profitability (even when I recognize that actual engineering includes such factors). A mesh of clusters each with four independent, middling-performance cores is not extraordinarily interesting; such is likely much less expensive to design than a more flexible system (and I suspect design effort is quite important not only due to lower volume but also engineer availability (and even general difficulties of growing a design team quickly) and responsiveness to target changes.

(Some optimization seems possible if one knows that there is substantial independence across groups of cores. If communication is almost always limited to a small number of cores, a better network topology and coherence system than mesh with distributed snoop filters seems likely. On the other hand, the uniformity of a mesh simplifies core allocation under variable sharing. Even so, it might be reasonable to divide each chip into four "nodes" each with two memory channels with slower access to 'remote' memory and possibly no coherence between nodes. Such might provide greater performance predictability than a global coherent mesh without excessively sacrificing flexibility of core allocation if most allocations are for a small number of cores. Being able to charge a few percent more for an all-in-one-cluster allocation — with capacity sharing and/or faster coherence — might compensate for infrequent underutilization — really, selling to spot users at a lower cost? — from bin-packing issues. However, such also complicates sales and resource management, which might be especially unattractive for smaller cloud service providers.)
< Previous Post in Thread 
TopicPosted ByDate
Ampere Altra Max 16MB LLC with 128 coresGanon2021/05/25 12:30 AM
  Ampere Altra Max 16MB LLC with 128 coresanon2021/05/25 02:11 AM
    Ampere Altra Max 16MB LLC with 128 coresHeikki Kultala2021/05/25 10:22 PM
      Ampere Altra Max 16MB LLC with 128 coresAnon2021/05/26 12:36 PM
        Ampere Altra Max 16MB LLC with 128 coresChester2021/05/26 01:54 PM
          Ampere Altra Max 16MB LLC with 128 coresChester2021/05/26 02:03 PM
  Ampere Altra Max 16MB LLC with 128 coresDoug S2021/05/25 06:50 AM
    Ampere Altra Max 16MB LLC with 128 coresAndrei F2021/05/25 07:06 AM
    Ampere Altra Max 16MB LLC with 128 coresRayla2021/05/25 07:17 AM
  A few thoughts on Ampere's Altra MaxPaul A. Clayton2021/05/27 11:00 AM
    A few thoughts on Ampere's Altra MaxBjörn Ragnar Björnsson2021/05/27 02:47 PM
      Yeah, I should have looked for and through a data sheet (NT)Paul A. Clayton2021/05/27 05:25 PM
    A few thoughts on Ampere's Altra MaxAdrian2021/05/27 10:13 PM
      Boring can be profitablePaul A. Clayton2021/05/29 11:18 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?