RKL taken branch throughput

By: --- (---.delete@this.redheron.com), May 15, 2021 10:07 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on May 14, 2021 10:34 pm wrote:
> Chester (lamchester.delete@this.gmail.com) on May 11, 2021 10:04 pm wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 9:00 pm wrote:
> > > Chester (lamchester.delete@this.gmail.com) on May 10, 2021 5:25 pm wrote:
> > > > Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 2:57 pm wrote:
> > > > > A worthwhile read on probing BTB behavior and size, including Intel, AMD and M1 chips:
> > > > >
> > > > > How many ifs are too many?
> > > > >
> > > > > One thing that caught my eye is that Marek measures better than one taken branch per
> > > > > cycle on Zen 3 (EPYC 7713), at least for code that fits in the L1 icache. That surprises
> > > > > me since I'm not aware of any mainstream uarch that can execute more than 1 taken branch
> > > > > per cycle (plenty can execute more than 1 untaken branches per cycle).
> > > > >
> > > > > Maybe it's just measurement error (e.g., due to turbo above
> > > > > the expected frequency), or can Zen 3 really do this?
> > > > >
> > > >
> > > > Rocket Lake/11900K actually goes beyond Zen 3 here. Up to 8 branches, it can do two taken
> > > > jumps per cycle (around 0.1 ns per jump). If there's only one branch (taken backward one
> > > > at the end of the loop), or more than 8 branches, it's 1 taken jump per cycle. Once there
> > > > are more than 256 branches in the loop, it climbs to 2 cycles per branch.
> > >
> > > Interesting, I guess it is the same on ICL then too. I had done
> > > some similar tests on Skylake, but never tested it on ICL yet.
> > >
> >
> > Likely the same yeah, though I never got anyone to test on ICL or TGL.
> >
> > What did you get for Skylake? I tested on my i5-6600K and got some strange results. It couldn't
> > do 1 branch per cycle for >4 branches, but there's no clean jump. At 16 branches, it was 1.57
> > cycles per branch. And it slowly climbed to 2 cycles per branch at 512 branches.
> >
>
> Well I never ran the cloudflare code, this was from a different set of tests in uarch-bench. I do
> get 1 cycle/branch for more than 4 branches (20 in this case, but that number is picked arbitrarily,
> not the limit beyond which 1 cycle is no longer possible). You can run them with --test-name=branch/*.
> I also found some odd/even effect that I forget now and apparently didn't write down.
>
> > I wonder if the disabled lsd/loop buffer had anything to do with it, because Sandy Bridge was able to
> > do up to 8 branches at 1 cycle per branch (and Haswell up to 128, matching Agner's observations).
>
> Yeah, the LSD seems promising as a way to get 2 jumps/cycle. We already know that the LSD essentially "unrolls"
> the loop in the LSD buffer (actually the IDQ), and so is in a sense a mini trace-cache. Perhaps jmps are removed
> and the executed uops on both sides just appear linearly in the LSD. This would be easy to test.

https://patents.google.com/patent/US9753733B2

< Previous Post in Thread 
TopicPosted ByDate
Post looking at BTB behavior and sizeTravis Downs2021/05/10 02:57 PM
  Post looking at BTB behavior and sizeAnon2021/05/10 04:43 PM
    Post looking at BTB behavior and sizeTravis Downs2021/05/10 08:59 PM
    Post looking at BTB behavior and sizeLinus Torvalds2021/05/11 10:13 AM
  RKL taken branch throughputChester2021/05/10 05:25 PM
    RKL taken branch throughputTravis Downs2021/05/10 09:00 PM
      RKL taken branch throughputChester2021/05/11 10:04 PM
        RKL taken branch throughputTravis Downs2021/05/14 10:34 PM
          RKL taken branch throughput---2021/05/15 10:07 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊