By: Adrian (a.delete@this.acm.org), September 22, 2021 1:57 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> Here are results:
> 8-byte (64b) accesses:
> 0 1064
> 1 1170
> 2 1170
> 3 1170
> 4 1170
> 5 1170
> 6 1170
> 7 1170
> 8 1071
> 9 1171
> 10 1171
> 11 1171
> 12 1171
> 13 1171
> 14 1171
> 15 1171
> 16 1067
> 17 1170
> 18 1170
> 19 1170
> 20 1170
> 21 1170
> 22 1170
> 23 1170
> 24 1065
> 25 1170
> 26 1170
> 27 1170
> 28 1170
> 29 1170
> 30 1170
> 31 1170
>
> 16-byte (128b) accesses:
> 0 483
> 1 701
> 2 701
> 3 701
> 4 701
> 5 701
> 6 701
> 7 701
> 8 701
> 9 701
> 10 701
> 11 702
> 12 701
> 13 701
> 14 701
> 15 701
> 16 483
> 17 702
> 18 702
> 19 702
> 20 702
> 21 702
> 22 701
> 23 702
> 24 701
> 25 702
> 26 702
> 27 702
> 28 702
> 29 701
> 30 702
> 31 701
>
>
> 32-byte (256b) accesses:
> 0 256
> 1 468
> 2 468
> 3 468
> 4 468
> 5 468
> 6 468
> 7 468
> 8 468
> 9 468
> 10 468
> 11 468
> 12 468
> 13 468
> 14 468
> 15 468
> 16 468
> 17 468
> 18 468
> 19 468
> 20 468
> 21 468
> 22 468
> 23 468
> 24 468
> 25 468
> 26 468
> 27 468
> 28 468
> 29 468
> 30 468
> 31 468
>
> Misalignment penalty [of streaming add):
> 8-byte - 1.10x
> 16-byte - 1.45x
> 32-byte - 1.83x
>
Here are Zen 3 results with no modifications on your benchmark code.
What is interesting is that Zen appears to be worse on 16B or 32B aligned data but better on unaligned data, so the unaligned penalty is much less in all cases.
Misalignment penalty [of streaming add):
8-byte - 1.08x
16-byte - 1.10x
32-byte - 1.09x
tst_8b
0 988
1 1042
2 1044
3 1048
4 1042
5 1041
6 1037
7 1038
8 985
9 1066
10 1063
11 1063
12 1062
13 1058
14 1065
15 1059
16 971
17 1004
18 1003
19 1007
20 1006
21 1012
22 1013
23 1014
24 993
25 1044
26 1039
27 1037
28 1037
29 1035
30 1033
31 1032
tst_16b
0 624
1 685
2 684
3 684
4 684
5 684
6 684
7 683
8 682
9 683
10 684
11 684
12 682
13 683
14 683
15 683
16 622
17 687
18 687
19 687
20 685
21 686
22 686
23 686
24 685
25 686
26 686
27 686
28 684
29 686
30 686
31 686
tst_32b
0 366
1 400
2 400
3 400
4 400
5 400
6 400
7 400
8 400
9 400
10 400
11 400
12 400
13 400
14 400
15 400
16 400
17 400
18 400
19 400
20 400
21 400
22 400
23 400
24 400
25 400
26 400
27 400
28 400
29 400
30 400
31 400
> Here are results:
> 8-byte (64b) accesses:
> 0 1064
> 1 1170
> 2 1170
> 3 1170
> 4 1170
> 5 1170
> 6 1170
> 7 1170
> 8 1071
> 9 1171
> 10 1171
> 11 1171
> 12 1171
> 13 1171
> 14 1171
> 15 1171
> 16 1067
> 17 1170
> 18 1170
> 19 1170
> 20 1170
> 21 1170
> 22 1170
> 23 1170
> 24 1065
> 25 1170
> 26 1170
> 27 1170
> 28 1170
> 29 1170
> 30 1170
> 31 1170
>
> 16-byte (128b) accesses:
> 0 483
> 1 701
> 2 701
> 3 701
> 4 701
> 5 701
> 6 701
> 7 701
> 8 701
> 9 701
> 10 701
> 11 702
> 12 701
> 13 701
> 14 701
> 15 701
> 16 483
> 17 702
> 18 702
> 19 702
> 20 702
> 21 702
> 22 701
> 23 702
> 24 701
> 25 702
> 26 702
> 27 702
> 28 702
> 29 701
> 30 702
> 31 701
>
>
> 32-byte (256b) accesses:
> 0 256
> 1 468
> 2 468
> 3 468
> 4 468
> 5 468
> 6 468
> 7 468
> 8 468
> 9 468
> 10 468
> 11 468
> 12 468
> 13 468
> 14 468
> 15 468
> 16 468
> 17 468
> 18 468
> 19 468
> 20 468
> 21 468
> 22 468
> 23 468
> 24 468
> 25 468
> 26 468
> 27 468
> 28 468
> 29 468
> 30 468
> 31 468
>
> Misalignment penalty [of streaming add):
> 8-byte - 1.10x
> 16-byte - 1.45x
> 32-byte - 1.83x
>
Here are Zen 3 results with no modifications on your benchmark code.
What is interesting is that Zen appears to be worse on 16B or 32B aligned data but better on unaligned data, so the unaligned penalty is much less in all cases.
Misalignment penalty [of streaming add):
8-byte - 1.08x
16-byte - 1.10x
32-byte - 1.09x
tst_8b
0 988
1 1042
2 1044
3 1048
4 1042
5 1041
6 1037
7 1038
8 985
9 1066
10 1063
11 1063
12 1062
13 1058
14 1065
15 1059
16 971
17 1004
18 1003
19 1007
20 1006
21 1012
22 1013
23 1014
24 993
25 1044
26 1039
27 1037
28 1037
29 1035
30 1033
31 1032
tst_16b
0 624
1 685
2 684
3 684
4 684
5 684
6 684
7 683
8 682
9 683
10 684
11 684
12 682
13 683
14 683
15 683
16 622
17 687
18 687
19 687
20 685
21 686
22 686
23 686
24 685
25 686
26 686
27 686
28 684
29 686
30 686
31 686
tst_32b
0 366
1 400
2 400
3 400
4 400
5 400
6 400
7 400
8 400
9 400
10 400
11 400
12 400
13 400
14 400
15 400
16 400
17 400
18 400
19 400
20 400
21 400
22 400
23 400
24 400
25 400
26 400
27 400
28 400
29 400
30 400
31 400