By: Adrian (a.delete@this.acm.org), September 22, 2021 5:05 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 22, 2021 5:48 am wrote:
> Adrian (a.delete@this.acm.org) on September 22, 2021 2:08 am wrote:
> >
> > For some reason, the first time when I have run tst_16B I have obtained worse results.
> >
> > When repeating the test, now I obtain reproducibly the following better values:
> >
> > (but the unaligned penalty remains under 10%)
> >
> >
> > tst_16B
> > 0 609
> > 1 668
> > 2 668
> > 3 668
> > 4 666
> > 5 667
> > 6 668
> > 7 667
> > 8 667
> > 9 667
> > 10 667
> > 11 668
> > 12 667
> > 13 668
> > 14 668
> > 15 668
> > 16 606
> > 17 670
> > 18 670
> > 19 671
> > 20 668
> > 21 670
> > 22 671
> > 23 670
> > 24 667
> > 25 670
> > 26 670
> > 27 670
> > 28 668
> > 29 670
> > 30 669
> > 31 669
> >
>
>
> Pay attention that my tst_16b is not really suitable for exploration
> of sequential load bandwidth of Zen3 L1D cache.
> According to my understanding of Zen3 execution resources, this test is bottlenecked
> by SIMD Integer ALU throughput (2 ops/clock) rather than by L1D load throughput
> (3 ops/clock for 8/16/32/64/128bit data, 2 ops/clock for 256-bit data).
>
>
I thought that this might be the case, but unfortunately at this moment I am busy with some work and I do not have time to modify the benchmark to see what happens.
So I have just reported the results with your benchmark "as is", as they seem good to know.
It is indeed possible that the unaligned speed is limited by the cache and the aligned speed by the ALUs and because of that the penalty for unaligned accesses happens to be negligible.
> Adrian (a.delete@this.acm.org) on September 22, 2021 2:08 am wrote:
> >
> > For some reason, the first time when I have run tst_16B I have obtained worse results.
> >
> > When repeating the test, now I obtain reproducibly the following better values:
> >
> > (but the unaligned penalty remains under 10%)
> >
> >
> > tst_16B
> > 0 609
> > 1 668
> > 2 668
> > 3 668
> > 4 666
> > 5 667
> > 6 668
> > 7 667
> > 8 667
> > 9 667
> > 10 667
> > 11 668
> > 12 667
> > 13 668
> > 14 668
> > 15 668
> > 16 606
> > 17 670
> > 18 670
> > 19 671
> > 20 668
> > 21 670
> > 22 671
> > 23 670
> > 24 667
> > 25 670
> > 26 670
> > 27 670
> > 28 668
> > 29 670
> > 30 669
> > 31 669
> >
>
>
> Pay attention that my tst_16b is not really suitable for exploration
> of sequential load bandwidth of Zen3 L1D cache.
> According to my understanding of Zen3 execution resources, this test is bottlenecked
> by SIMD Integer ALU throughput (2 ops/clock) rather than by L1D load throughput
> (3 ops/clock for 8/16/32/64/128bit data, 2 ops/clock for 256-bit data).
>
>
I thought that this might be the case, but unfortunately at this moment I am busy with some work and I do not have time to modify the benchmark to see what happens.
So I have just reported the results with your benchmark "as is", as they seem good to know.
It is indeed possible that the unaligned speed is limited by the cache and the aligned speed by the ALUs and because of that the penalty for unaligned accesses happens to be negligible.