By: anon2 (anon.delete@this.anon.com), August 30, 2022 1:35 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on August 29, 2022 11:37 pm wrote:
> > POWER9 seems to do similar. I'm sure the PA6T PowerPC heritage
> > link is purely coincidental. From OpenPOWER user
> > manual: "As instructions are fetched, they are scanned
> > for branches. Up to eight branches are simultaneously
> > processed by the branch prediction logic that predicts both
> > the direction and/or target of the branches, depending
> > on the branch type." It does have some kind of L0 predictor
> > (BTAC) which does predict addresses as well though,
> > unclear how works but it is very fast so not like huge multi level target buffer of x86 CPUs.
>
> Seems common to have both. For example, Cortex A72 seems to have a decoupled 64 entry L1 BTB.
> After that, it seems to have a second level BTB coupled to the L1i, that can track up to 4096
> branch targets as the branches don't spill out of the 48 KB L1i. So if there's a branch every
> 16 bytes, it can track 3072 targets. Or 768 if there's a branch every 64 bytes, and so on.
>
> If a branch target comes out of the L1 BTB, there's a one cycle penalty.
> If it comes out of the second level BTB/L1i, there's a 2 cycle penalty.
>
> > It makes sense that Apple really likes a very large I$ if they are doing coupled fetch. And coupled
> > fetch has real benefits, you don't have to predict the presence of a branch, and you don't have
> > to predict target for direct branches (which should be the large majority).
>
> You still have to predict the presence of a branch and predict that it's going to be taken.
If you have some L0 structure to minimize pipeline bubble on taken, yes (which presumably everyone has). Such a thing has to predict address as well, it's not really different from the BTB in a decoupled fetch pipeline, in terms of what it predicts.
The advantage
> is you don't have to index into a separate BTB structure to fetch the branch target if you predict it's
> taken. You get the predicted branch target along with the L1i fetch, no other lookup needed.
>
> The downside is you have no clue where to go next once you miss L1i. Branches are pretty
> common, so if there is a taken branch coming up, you won't be able to follow it. With a
> decoupled BTB, you can still index into that even if your L1i fetch missed, and prefetch
> far enough to cover L2 latency - assuming your branch predictor is reasonably accurate.
Right. You could also accommodate that with an I$ prefetch prediction structure that could be far cheaper than a branch predictor. You don't have to predict every single branch, only I$ misses which might be fewer by an order of magnitude. So with an L2 latency about par with mispredict penalty, you could do nothing and that might be ~equivalent penalty as 90% accurate branch predictor already. Might not take too much to get it up high enough to a state of the art branch predictor. Especially with very big I$ and big fast L2 and close memory as Apple has.
IOW, decoupled fetch with big BTBs may not *be* the next step for Apple.
>
> > This allow equivalent
> > accuracy with smaller structures. Downside being you don't get I$ prefetch prediction from the same
> > structure, and you couldn't avoid taken branch bubbles with high frequency. But that doesn't mean
> > you can't have a prefetch prediction from other structures, as perhaps POWER9 does.
>
> Yeah, branch targets stored alongside L1i (coupled BTB) tend to take more than 1 cycle latency to access.
> > POWER9 seems to do similar. I'm sure the PA6T PowerPC heritage
> > link is purely coincidental. From OpenPOWER user
> > manual: "As instructions are fetched, they are scanned
> > for branches. Up to eight branches are simultaneously
> > processed by the branch prediction logic that predicts both
> > the direction and/or target of the branches, depending
> > on the branch type." It does have some kind of L0 predictor
> > (BTAC) which does predict addresses as well though,
> > unclear how works but it is very fast so not like huge multi level target buffer of x86 CPUs.
>
> Seems common to have both. For example, Cortex A72 seems to have a decoupled 64 entry L1 BTB.
> After that, it seems to have a second level BTB coupled to the L1i, that can track up to 4096
> branch targets as the branches don't spill out of the 48 KB L1i. So if there's a branch every
> 16 bytes, it can track 3072 targets. Or 768 if there's a branch every 64 bytes, and so on.
>
> If a branch target comes out of the L1 BTB, there's a one cycle penalty.
> If it comes out of the second level BTB/L1i, there's a 2 cycle penalty.
>
> > It makes sense that Apple really likes a very large I$ if they are doing coupled fetch. And coupled
> > fetch has real benefits, you don't have to predict the presence of a branch, and you don't have
> > to predict target for direct branches (which should be the large majority).
>
> You still have to predict the presence of a branch and predict that it's going to be taken.
If you have some L0 structure to minimize pipeline bubble on taken, yes (which presumably everyone has). Such a thing has to predict address as well, it's not really different from the BTB in a decoupled fetch pipeline, in terms of what it predicts.
The advantage
> is you don't have to index into a separate BTB structure to fetch the branch target if you predict it's
> taken. You get the predicted branch target along with the L1i fetch, no other lookup needed.
>
> The downside is you have no clue where to go next once you miss L1i. Branches are pretty
> common, so if there is a taken branch coming up, you won't be able to follow it. With a
> decoupled BTB, you can still index into that even if your L1i fetch missed, and prefetch
> far enough to cover L2 latency - assuming your branch predictor is reasonably accurate.
Right. You could also accommodate that with an I$ prefetch prediction structure that could be far cheaper than a branch predictor. You don't have to predict every single branch, only I$ misses which might be fewer by an order of magnitude. So with an L2 latency about par with mispredict penalty, you could do nothing and that might be ~equivalent penalty as 90% accurate branch predictor already. Might not take too much to get it up high enough to a state of the art branch predictor. Especially with very big I$ and big fast L2 and close memory as Apple has.
IOW, decoupled fetch with big BTBs may not *be* the next step for Apple.
>
> > This allow equivalent
> > accuracy with smaller structures. Downside being you don't get I$ prefetch prediction from the same
> > structure, and you couldn't avoid taken branch bubbles with high frequency. But that doesn't mean
> > you can't have a prefetch prediction from other structures, as perhaps POWER9 does.
>
> Yeah, branch targets stored alongside L1i (coupled BTB) tend to take more than 1 cycle latency to access.