Modern cores

By: Maynard Handley (, July 22, 2020 1:19 pm
Room: Moderated Discussions
Etienne ( on July 22, 2020 11:15 am wrote:
> Maynard Handley ( on July 22, 2020 10:03 am wrote:
> > I cannot recommend enough the paper,
> >
> > (Evolution of the Samsung Exynos CPU Microarchitecture)
> >
> > It's like those classic papers written at the end of the RISC era, detailing the state of OoO
> > just before the relevant companies like MIPS/SGI and Alpha/DEC became defunct. The value is in
> > clarifying various aspects of how current (acceptably, but not best of class) CPUs perform things
> > like indirect branch prediction, cache management, or prefetching; showing in particular the degree
> > of complexity present beyond simple statements like "performs strided prefetch".
> >
> Extract: For those reasons, the standalone prefetcher uses an algorithm
> to handle long, complex streams with larger training structures,
> and techniques to reuse learnings across 4KB physical page
> crossings.
> So all those measures on real code, and core improvements following these measures,
> would they be impacted by the fact Apple has chosen 16 Kbytes/pages at the OS level?

There's so much unknown that it's not clear that's a useful question.

Presumably pretty much everything Samsung describes, Apple does, and in a more effective (and more transistor-intensive...) fashion.
So consider the L1 prefetchers. Samsung says that they operate in virtual space and traverse across pages. If most of your prefetch win is from those L1 prefetchers, then the 4 vs 16kB pages may not matter much -- a minor effect on a minor effect.

Samsung also says that they transfer a lot of metadata, related to cache utilization and prefetching, up and down the cache hierarchy along with the line data. It would not be a huge stretch to augment such a scheme with even more metadata giving an "outer level" prefetcher information about the physical page of the next virtual page of a stream.

The big question is: what's the problem that the Apple 2 "outer level" prefetchers are trying to solve? In a perfect world, one can imagine that the optimal way to run prefetching is 100% controlled and co-ordinated at the core. That's the level that knows the stream of references best (and I was thrilled to see that Samsung are doing smart things like restoring load/store reference order before training their prefetchers). IF you can control everything at the L1 level, then do you even need an "outer" prefetcher. (Note that control at L1 doesn't mean everything is necessarily fetched INTO L1, again in a perfect world, low-confidence fetches, or far future fetches, could be fetched into system cache or L2,even though co-ordinated at L1.)
So does Apple operate like that?

An alternative viewpoint is there are at least some workloads that involve data flow between mutiple cores, eg a producer/consumer setup, or CPU/GPU/ISP/NPU co-ordination. Is there enough structure in those, enough patterning visible at the L2 or system cache level, to train a prefetcher? And for those cases, is there enough flow across page boundaries that anyone cares?
(An even more obvious question is the extent to which Apple exploits multiple page sizes. Obviously much ISP and GPU data is fairly large, of order MB. But when I looked (admittedly not really knowing what I was doing) in Darwin source I could find no reference anywhere to anything but a 16kiB page size. Note that I'm not asking the OS to handle something more complicated like transparent large pages, all I'm talking about is the use of static larger pages, for temporary short term purposes.
If Apple doesn't see much value in that, do we take that to mean that it's not worth worrying about for Apple's current task profile? Or that it's a hard problem that's on their agenda but they haven't got to it yet? Or that they're being dumb and leaving this particular performance win on the table because there's no particular champion within Apple willing to push it?)
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Modern coresMaynard Handley2020/07/22 09:03 AM
  Modern coresEtienne2020/07/22 10:15 AM
    Modern coresMaynard Handley2020/07/22 01:19 PM
      Modern coresanon2020/07/22 03:13 PM
        Modern coresMaynard Handley2020/07/22 05:29 PM
          Modern coresChester2020/07/22 10:59 PM
            Modern coresMaynard Handley2020/07/23 09:06 AM
              Modern coresChester2020/07/23 10:33 AM
              Modern coresDoug S2020/07/23 02:14 PM
      You are ignoring the effect of page size to cache way size (NT)Heikki Kultala2020/07/23 06:16 AM
  Modern coresanon2020/07/22 03:18 PM
    Modern coresUnmaskedUnderflow2020/07/23 07:50 AM
  Modern coresJouni Osmala2020/07/22 10:17 PM
Reply to this Topic
Body: No Text
How do you spell avocado?