RKL taken branch throughput

By: Travis Downs (travis.downs.delete@this.gmail.com), May 10, 2021 9:00 pm
Chester (lamchester.delete@this.gmail.com) on May 10, 2021 5:25 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 2:57 pm wrote:
> > A worthwhile read on probing BTB behavior and size, including Intel, AMD and M1 chips:
> >
> > How many ifs are too many?
> >
> > One thing that caught my eye is that Marek measures better than one taken branch per
> > cycle on Zen 3 (EPYC 7713), at least for code that fits in the L1 icache. That surprises
> > me since I'm not aware of any mainstream uarch that can execute more than 1 taken branch
> > per cycle (plenty can execute more than 1 untaken branches per cycle).
> >
> > Maybe it's just measurement error (e.g., due to turbo above
> > the expected frequency), or can Zen 3 really do this?
> >
> Rocket Lake/11900K actually goes beyond Zen 3 here. Up to 8 branches, it can do two taken
> jumps per cycle (around 0.1 ns per jump). If there's only one branch (taken backward one
> at the end of the loop), or more than 8 branches, it's 1 taken jump per cycle. Once there
> are more than 256 branches in the loop, it climbs to 2 cycles per branch.

Interesting, I guess it is the same on ICL then too. I had done some similar tests on Skylake, but never tested it on ICL yet.

> I wasn't able to replicate their Zen 3 result when I got someone to run it on a 5950X.
> I got one branch per cycle up to ~1024 branches (0.2 ns per branch, lining up with
> 5.05 GHz max boost), after which it increases to ~3 cycles per branch at 2048 branches.
> I didn't see any other CPU get more than 1 taken branch per cycle either.
> I wrote my test well before this article came out (simple test, only forward unconditional jumps
> spaced out by 16 bytes except the loop branch which is taken backward), so it's not directly comparable.
> But from the graph it doesn't seem far below 1 so I suspect they didn't account for boost.

That's my guess too.

