By: --- (---.delete@this.redheron.com), August 30, 2022 11:29 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 29, 2022 9:54 pm wrote:
> > > You clearly have no clue the lengths Apple go to to save energy in Fetch.
> >
> > No it's more that you don't understand physical design. Building wider machinery so you
> > can do more work in one cycle so you can "sleep" for a few cycles is not a good thing. The
> > "race to idle" idea you might be basing it on operates on utterly different scales.
> >
>
> Or that reading patents is not the same as showing something is in use in a design. Companies patent
> things all the time just in case they might employ the strategy. That doesn't mean they're using it
> or will use it any time in the future, and doesn't mean anyone else will use it either. Also, patents
> are often extremely vague. That's partially so lawyers have a ton of room to claim patent infringement.
> It also makes them worthless for claiming a certain microarchitecture detail exists.
>
> If you want to claim that "Apple Fetch prediction predicts the trace width, not just the trace address", your
> post needs to include more detail than just patents. Ditto for whether a loop buffer or L0 icache exists.
>
> > BTW In spite of Apple Fetch being so advanced, they actually remains a lot they can still do!
> > While they were early adopters of Decoupled Fetch (as has I think, *everyone* nowadays, eventually),
> > they only have "first stage" decoupling, withe the pipeline looking like
> > [Fetch Address Predict] -> [Fetch cache Access] -> Queue of Instructions -> [Decode].
> > They have not adopted the next step (neither has anyone
> > else yet?), as suggested in Glenn Reinmann's thesis, of
> > [Fetch Address Predict] -> Queue of predicted addresses ->
> > [Fetch cache Access] -> Queue of Instructions -> [Decode].
>
> Pretty sure the second is what everyone does in their high performance designs, except for
> Apple. For the past decade, and more. Apple actually seems to be unique in doing [fetch
> address predict] -> [fetch cache access] -> [fetch result drives next prediction] for their
> main BTB level, based on how branch latency jumps as the loop exceeds L1i size.
What Apple is doing (in that respect) is
(2016) https://patents.google.com/patent/US20140075156A1 Scan-on-fill next fetch target prediction (yah, yeah, you don't believe patents, whatever).
> Even Sandy Bridge does [Fetch Address Predict] -> Queue of predicted addresses -> [Fetch cache Access]
> -> Queue of Instructions -> [Decode], as branch latency doesn't substantially increase as the loop exceeds
> L1i capacity. It so happens that Apple using the former method doesn't matter much - they have such a
> honking huge L1i that the miss case is likely substantially rarer than on other architectures.
What you say about Sandy Bridge is interesting, but with deep enough queues in various places (especially the obvious queue of *basic* Decoupled Fetch, between the cache access and decode) L2 access can be hidden.
Not to mention that basic sequential or similar I-prefetching might be doing the job...
It is very difficult to be quite sure what buffering queue is hiding latency. At the very least I'd want to be sure that the more obvious candidates suggested above were not in play.
My guessed analysis (based on some Apple knowledge, but zero Intel knowledge) is that Apple have split I-Prefetch into two distinct parts
- one is "long range" prefetch, which is triggered and governed essentially by calls and returns. Call history is used to drive this, but it will not kick in simply for very long single functions. In the academic literature this is similar to something called RDIP.
- the second looks something like basic FDIP, driven by Fetch decoupled from Decode, so that Fetch executes a long way ahead of Decode and the queue between Fetch and Decode can cover a miss to L2. There probably is some slight augmentation here (like right after an RDIP jump, not just the target line but a few successor lines are also pulled in, but I know nothing of this)
What this ultimately means is that, if the long branch loops are basically sequential I-cache accesses, an old-school sequential I-cache prefetcher will behave very differently from Apple's I-prefetcher because Apple does not seem to have such a sequential I-prefetcher generically (though probably does immediately after a long distance jump).
The nice thing about the second decoupling queue is that it allows for more substantial hiding of more "random" bouncing around inside a footprint too large for L1 but fitting in L2, which is why I suspect Apple will gravitate towards it in time, but has not yet prioritized it. Of course such a scheme would benefit smaller I-caches even more, but in the literature I don't get the feeling that any designs have implemented this yet.
> > > You clearly have no clue the lengths Apple go to to save energy in Fetch.
> >
> > No it's more that you don't understand physical design. Building wider machinery so you
> > can do more work in one cycle so you can "sleep" for a few cycles is not a good thing. The
> > "race to idle" idea you might be basing it on operates on utterly different scales.
> >
>
> Or that reading patents is not the same as showing something is in use in a design. Companies patent
> things all the time just in case they might employ the strategy. That doesn't mean they're using it
> or will use it any time in the future, and doesn't mean anyone else will use it either. Also, patents
> are often extremely vague. That's partially so lawyers have a ton of room to claim patent infringement.
> It also makes them worthless for claiming a certain microarchitecture detail exists.
>
> If you want to claim that "Apple Fetch prediction predicts the trace width, not just the trace address", your
> post needs to include more detail than just patents. Ditto for whether a loop buffer or L0 icache exists.
>
> > BTW In spite of Apple Fetch being so advanced, they actually remains a lot they can still do!
> > While they were early adopters of Decoupled Fetch (as has I think, *everyone* nowadays, eventually),
> > they only have "first stage" decoupling, withe the pipeline looking like
> > [Fetch Address Predict] -> [Fetch cache Access] -> Queue of Instructions -> [Decode].
> > They have not adopted the next step (neither has anyone
> > else yet?), as suggested in Glenn Reinmann's thesis, of
> > [Fetch Address Predict] -> Queue of predicted addresses ->
> > [Fetch cache Access] -> Queue of Instructions -> [Decode].
>
> Pretty sure the second is what everyone does in their high performance designs, except for
> Apple. For the past decade, and more. Apple actually seems to be unique in doing [fetch
> address predict] -> [fetch cache access] -> [fetch result drives next prediction] for their
> main BTB level, based on how branch latency jumps as the loop exceeds L1i size.
What Apple is doing (in that respect) is
(2016) https://patents.google.com/patent/US20140075156A1 Scan-on-fill next fetch target prediction (yah, yeah, you don't believe patents, whatever).
> Even Sandy Bridge does [Fetch Address Predict] -> Queue of predicted addresses -> [Fetch cache Access]
> -> Queue of Instructions -> [Decode], as branch latency doesn't substantially increase as the loop exceeds
> L1i capacity. It so happens that Apple using the former method doesn't matter much - they have such a
> honking huge L1i that the miss case is likely substantially rarer than on other architectures.
What you say about Sandy Bridge is interesting, but with deep enough queues in various places (especially the obvious queue of *basic* Decoupled Fetch, between the cache access and decode) L2 access can be hidden.
Not to mention that basic sequential or similar I-prefetching might be doing the job...
It is very difficult to be quite sure what buffering queue is hiding latency. At the very least I'd want to be sure that the more obvious candidates suggested above were not in play.
My guessed analysis (based on some Apple knowledge, but zero Intel knowledge) is that Apple have split I-Prefetch into two distinct parts
- one is "long range" prefetch, which is triggered and governed essentially by calls and returns. Call history is used to drive this, but it will not kick in simply for very long single functions. In the academic literature this is similar to something called RDIP.
- the second looks something like basic FDIP, driven by Fetch decoupled from Decode, so that Fetch executes a long way ahead of Decode and the queue between Fetch and Decode can cover a miss to L2. There probably is some slight augmentation here (like right after an RDIP jump, not just the target line but a few successor lines are also pulled in, but I know nothing of this)
What this ultimately means is that, if the long branch loops are basically sequential I-cache accesses, an old-school sequential I-cache prefetcher will behave very differently from Apple's I-prefetcher because Apple does not seem to have such a sequential I-prefetcher generically (though probably does immediately after a long distance jump).
The nice thing about the second decoupling queue is that it allows for more substantial hiding of more "random" bouncing around inside a footprint too large for L1 but fitting in L2, which is why I suspect Apple will gravitate towards it in time, but has not yet prioritized it. Of course such a scheme would benefit smaller I-caches even more, but in the literature I don't get the feeling that any designs have implemented this yet.