By: Chester (lamchester.delete@this.gmail.com), May 11, 2021 10:04 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 9:00 pm wrote:
> Chester (lamchester.delete@this.gmail.com) on May 10, 2021 5:25 pm wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 2:57 pm wrote:
> > > A worthwhile read on probing BTB behavior and size, including Intel, AMD and M1 chips:
> > >
> > > How many ifs are too many?
> > >
> > > One thing that caught my eye is that Marek measures better than one taken branch per
> > > cycle on Zen 3 (EPYC 7713), at least for code that fits in the L1 icache. That surprises
> > > me since I'm not aware of any mainstream uarch that can execute more than 1 taken branch
> > > per cycle (plenty can execute more than 1 untaken branches per cycle).
> > >
> > > Maybe it's just measurement error (e.g., due to turbo above
> > > the expected frequency), or can Zen 3 really do this?
> > >
> >
> > Rocket Lake/11900K actually goes beyond Zen 3 here. Up to 8 branches, it can do two taken
> > jumps per cycle (around 0.1 ns per jump). If there's only one branch (taken backward one
> > at the end of the loop), or more than 8 branches, it's 1 taken jump per cycle. Once there
> > are more than 256 branches in the loop, it climbs to 2 cycles per branch.
>
> Interesting, I guess it is the same on ICL then too. I had done
> some similar tests on Skylake, but never tested it on ICL yet.
>
Likely the same yeah, though I never got anyone to test on ICL or TGL.
What did you get for Skylake? I tested on my i5-6600K and got some strange results. It couldn't do 1 branch per cycle for >4 branches, but there's no clean jump. At 16 branches, it was 1.57 cycles per branch. And it slowly climbed to 2 cycles per branch at 512 branches.
I wonder if the disabled lsd/loop buffer had anything to do with it, because Sandy Bridge was able to do up to 8 branches at 1 cycle per branch (and Haswell up to 128, matching Agner's observations).
> > I wasn't able to replicate their Zen 3 result when I got someone to run it on a 5950X.
> > I got one branch per cycle up to ~1024 branches (0.2 ns per branch, lining up with
> > 5.05 GHz max boost), after which it increases to ~3 cycles per branch at 2048 branches.
> > I didn't see any other CPU get more than 1 taken branch per cycle either.
> >
> > I wrote my test well before this article came out (simple test, only forward unconditional jumps
> > spaced out by 16 bytes except the loop branch which is taken backward), so it's not directly comparable.
> > But from the graph it doesn't seem far below 1 so I suspect they didn't account for boost.
>
> That's my guess too.
> Chester (lamchester.delete@this.gmail.com) on May 10, 2021 5:25 pm wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on May 10, 2021 2:57 pm wrote:
> > > A worthwhile read on probing BTB behavior and size, including Intel, AMD and M1 chips:
> > >
> > > How many ifs are too many?
> > >
> > > One thing that caught my eye is that Marek measures better than one taken branch per
> > > cycle on Zen 3 (EPYC 7713), at least for code that fits in the L1 icache. That surprises
> > > me since I'm not aware of any mainstream uarch that can execute more than 1 taken branch
> > > per cycle (plenty can execute more than 1 untaken branches per cycle).
> > >
> > > Maybe it's just measurement error (e.g., due to turbo above
> > > the expected frequency), or can Zen 3 really do this?
> > >
> >
> > Rocket Lake/11900K actually goes beyond Zen 3 here. Up to 8 branches, it can do two taken
> > jumps per cycle (around 0.1 ns per jump). If there's only one branch (taken backward one
> > at the end of the loop), or more than 8 branches, it's 1 taken jump per cycle. Once there
> > are more than 256 branches in the loop, it climbs to 2 cycles per branch.
>
> Interesting, I guess it is the same on ICL then too. I had done
> some similar tests on Skylake, but never tested it on ICL yet.
>
Likely the same yeah, though I never got anyone to test on ICL or TGL.
What did you get for Skylake? I tested on my i5-6600K and got some strange results. It couldn't do 1 branch per cycle for >4 branches, but there's no clean jump. At 16 branches, it was 1.57 cycles per branch. And it slowly climbed to 2 cycles per branch at 512 branches.
I wonder if the disabled lsd/loop buffer had anything to do with it, because Sandy Bridge was able to do up to 8 branches at 1 cycle per branch (and Haswell up to 128, matching Agner's observations).
> > I wasn't able to replicate their Zen 3 result when I got someone to run it on a 5950X.
> > I got one branch per cycle up to ~1024 branches (0.2 ns per branch, lining up with
> > 5.05 GHz max boost), after which it increases to ~3 cycles per branch at 2048 branches.
> > I didn't see any other CPU get more than 1 taken branch per cycle either.
> >
> > I wrote my test well before this article came out (simple test, only forward unconditional jumps
> > spaced out by 16 bytes except the loop branch which is taken backward), so it's not directly comparable.
> > But from the graph it doesn't seem far below 1 so I suspect they didn't account for boost.
>
> That's my guess too.
Topic | Posted By | Date |
---|---|---|
Post looking at BTB behavior and size | Travis Downs | 2021/05/10 02:57 PM |
Post looking at BTB behavior and size | Anon | 2021/05/10 04:43 PM |
Post looking at BTB behavior and size | Travis Downs | 2021/05/10 08:59 PM |
Post looking at BTB behavior and size | Linus Torvalds | 2021/05/11 10:13 AM |
RKL taken branch throughput | Chester | 2021/05/10 05:25 PM |
RKL taken branch throughput | Travis Downs | 2021/05/10 09:00 PM |
RKL taken branch throughput | Chester | 2021/05/11 10:04 PM |
RKL taken branch throughput | Travis Downs | 2021/05/14 10:34 PM |
RKL taken branch throughput | --- | 2021/05/15 10:07 AM |