By: Michael S (already5chosen.delete@this.yahoo.com), September 22, 2021 4:48 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on September 22, 2021 2:08 am wrote:
>
> For some reason, the first time when I have run tst_16B I have obtained worse results.
>
> When repeating the test, now I obtain reproducibly the following better values:
>
> (but the unaligned penalty remains under 10%)
>
>
> tst_16B
> 0 609
> 1 668
> 2 668
> 3 668
> 4 666
> 5 667
> 6 668
> 7 667
> 8 667
> 9 667
> 10 667
> 11 668
> 12 667
> 13 668
> 14 668
> 15 668
> 16 606
> 17 670
> 18 670
> 19 671
> 20 668
> 21 670
> 22 671
> 23 670
> 24 667
> 25 670
> 26 670
> 27 670
> 28 668
> 29 670
> 30 669
> 31 669
>
Pay attention that my tst_16b is not really suitable for exploration of sequential load bandwidth of Zen3 L1D cache.
According to my understanding of Zen3 execution resources, this test is bottlenecked by SIMD Integer ALU throughput (2 ops/clock) rather than by L1D load throughput (3 ops/clock for 8/16/32/64/128bit data, 2 ops/clock for 256-bit data).
>
> For some reason, the first time when I have run tst_16B I have obtained worse results.
>
> When repeating the test, now I obtain reproducibly the following better values:
>
> (but the unaligned penalty remains under 10%)
>
>
> tst_16B
> 0 609
> 1 668
> 2 668
> 3 668
> 4 666
> 5 667
> 6 668
> 7 667
> 8 667
> 9 667
> 10 667
> 11 668
> 12 667
> 13 668
> 14 668
> 15 668
> 16 606
> 17 670
> 18 670
> 19 671
> 20 668
> 21 670
> 22 671
> 23 670
> 24 667
> 25 670
> 26 670
> 27 670
> 28 668
> 29 670
> 30 669
> 31 669
>
Pay attention that my tst_16b is not really suitable for exploration of sequential load bandwidth of Zen3 L1D cache.
According to my understanding of Zen3 execution resources, this test is bottlenecked by SIMD Integer ALU throughput (2 ops/clock) rather than by L1D load throughput (3 ops/clock for 8/16/32/64/128bit data, 2 ops/clock for 256-bit data).