RKL taken branch throughput

By: Travis Downs (travis.downs.delete@this.gmail.com), May 14, 2021 10:34 pm
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on May 11, 2021 10:04 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 9:00 pm wrote:
> > Chester (lamchester.delete@this.gmail.com) on May 10, 2021 5:25 pm wrote:
> > > Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 2:57 pm wrote:
> > > > A worthwhile read on probing BTB behavior and size, including Intel, AMD and M1 chips:
> > > >
> > > > How many ifs are too many?
> > > >
> > > > One thing that caught my eye is that Marek measures better than one taken branch per
> > > > cycle on Zen 3 (EPYC 7713), at least for code that fits in the L1 icache. That surprises
> > > > me since I'm not aware of any mainstream uarch that can execute more than 1 taken branch
> > > > per cycle (plenty can execute more than 1 untaken branches per cycle).
> > > >
> > > > Maybe it's just measurement error (e.g., due to turbo above
> > > > the expected frequency), or can Zen 3 really do this?
> > > >
> > >
> > > Rocket Lake/11900K actually goes beyond Zen 3 here. Up to 8 branches, it can do two taken
> > > jumps per cycle (around 0.1 ns per jump). If there's only one branch (taken backward one
> > > at the end of the loop), or more than 8 branches, it's 1 taken jump per cycle. Once there
> > > are more than 256 branches in the loop, it climbs to 2 cycles per branch.
> >
> > Interesting, I guess it is the same on ICL then too. I had done
> > some similar tests on Skylake, but never tested it on ICL yet.
> >
>
> Likely the same yeah, though I never got anyone to test on ICL or TGL.
>
> What did you get for Skylake? I tested on my i5-6600K and got some strange results. It couldn't
> do 1 branch per cycle for >4 branches, but there's no clean jump. At 16 branches, it was 1.57
> cycles per branch. And it slowly climbed to 2 cycles per branch at 512 branches.
>

Well I never ran the cloudflare code, this was from a different set of tests in uarch-bench. I do get 1 cycle/branch for more than 4 branches (20 in this case, but that number is picked arbitrarily, not the limit beyond which 1 cycle is no longer possible). You can run them with --test-name=branch/*. I also found some odd/even effect that I forget now and apparently didn't write down.

> I wonder if the disabled lsd/loop buffer had anything to do with it, because Sandy Bridge was able to
> do up to 8 branches at 1 cycle per branch (and Haswell up to 128, matching Agner's observations).

Yeah, the LSD seems promising as a way to get 2 jumps/cycle. We already know that the LSD essentially "unrolls" the loop in the LSD buffer (actually the IDQ), and so is in a sense a mini trace-cache. Perhaps jmps are removed and the executed uops on both sides just appear linearly in the LSD. This would be easy to test.

It does seem like jmp handling has changed in ICL: they no longer result in an executed uop. The tests involved are much too large to fit into the LSD: most ops come from MITE (i.e. legacy decoder).

>
> > > I wasn't able to replicate their Zen 3 result when I got someone to run it on a 5950X.
> > > I got one branch per cycle up to ~1024 branches (0.2 ns per branch, lining up with
> > > 5.05 GHz max boost), after which it increases to ~3 cycles per branch at 2048 branches.
> > > I didn't see any other CPU get more than 1 taken branch per cycle either.
> > >
> > > I wrote my test well before this article came out (simple test, only forward unconditional jumps
> > > spaced out by 16 bytes except the loop branch which is taken backward), so it's not directly comparable.

This is similar to my test. Space them out by 32 bytes and you should see 1 per cycle throughput on Intel: only the "first" branch in every 32 byte chunk seems to get the fast treatment.

> > > But from the graph it doesn't seem far below 1 so I suspect they didn't account for boost.
> >
> > That's my guess too.

See also here.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Post looking at BTB behavior and sizeTravis Downs2021/05/10 02:57 PM
  Post looking at BTB behavior and sizeAnon2021/05/10 04:43 PM
    Post looking at BTB behavior and sizeTravis Downs2021/05/10 08:59 PM
    Post looking at BTB behavior and sizeLinus Torvalds2021/05/11 10:13 AM
  RKL taken branch throughputChester2021/05/10 05:25 PM
    RKL taken branch throughputTravis Downs2021/05/10 09:00 PM
      RKL taken branch throughputChester2021/05/11 10:04 PM
        RKL taken branch throughputTravis Downs2021/05/14 10:34 PM
          RKL taken branch throughput---2021/05/15 10:07 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊