By: Michael S (already5chosen.delete@this.yahoo.com), September 19, 2021 4:46 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on September 18, 2021 3:58 pm wrote:
> Jörn Engel (joern.delete@this.purestorage.com) on September 18, 2021 2:01 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> > > Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> > > >
> > > > It won't. Unaligned access is a solved problem on any CPU
> > > > that cares about performance. On Intel the difference
> > > > between vmovdqu and vmovdqa on aligned data is zero - both
> > > > instructions are equally fast. vmovdqu on unaligned
> > > > data is maybe 10% slower than on aligned data, not a big deal either.
> > >
> > > 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> > > loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
> >
> > Care to test your expectation? I tend to trust empirical results more than human expectations.
> >
> > Independent reproduction of my results:
> > https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
>
> For 32-bit and 64-bit data elements I'd also expect small penalty. On today's CPU
> for 32-bit elements I'd expect *less* than 10%. Not so for 256-bit elements.
>
> As to doing my own microbenchmark, may be, tomorrow.
> It's a night here now.
>
So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
Here are results:
8-byte (64b) accesses:
0 1064
1 1170
2 1170
3 1170
4 1170
5 1170
6 1170
7 1170
8 1071
9 1171
10 1171
11 1171
12 1171
13 1171
14 1171
15 1171
16 1067
17 1170
18 1170
19 1170
20 1170
21 1170
22 1170
23 1170
24 1065
25 1170
26 1170
27 1170
28 1170
29 1170
30 1170
31 1170
16-byte (128b) accesses:
0 483
1 701
2 701
3 701
4 701
5 701
6 701
7 701
8 701
9 701
10 701
11 702
12 701
13 701
14 701
15 701
16 483
17 702
18 702
19 702
20 702
21 702
22 701
23 702
24 701
25 702
26 702
27 702
28 702
29 701
30 702
31 701
32-byte (256b) accesses:
0 256
1 468
2 468
3 468
4 468
5 468
6 468
7 468
8 468
9 468
10 468
11 468
12 468
13 468
14 468
15 468
16 468
17 468
18 468
19 468
20 468
21 468
22 468
23 468
24 468
25 468
26 468
27 468
28 468
29 468
30 468
31 468
Misalignment penalty [of streaming add):
8-byte - 1.10x
16-byte - 1.45x
32-byte - 1.83x
So, on this particular CPU the penalty is even bigger than what I expected.
Quite possibly, on SKX with 512b accesses the penalty would be over 2x. Unfortunately, right now I have no access to SKX.
> Jörn Engel (joern.delete@this.purestorage.com) on September 18, 2021 2:01 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 17, 2021 7:48 am wrote:
> > > Jörn Engel (joern.delete@this.purestorage.com) on September 17, 2021 5:42 am wrote:
> > > >
> > > > It won't. Unaligned access is a solved problem on any CPU
> > > > that cares about performance. On Intel the difference
> > > > between vmovdqu and vmovdqa on aligned data is zero - both
> > > > instructions are equally fast. vmovdqu on unaligned
> > > > data is maybe 10% slower than on aligned data, not a big deal either.
> > >
> > > 256-bit form is only 10% slower for L1D hit? May be, taken individually. But in the tight
> > > loop, like in memcpy or integer variant of Stream Add, I'd expect it to be ~1.5x slower.
> >
> > Care to test your expectation? I tend to trust empirical results more than human expectations.
> >
> > Independent reproduction of my results:
> > https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/
>
> For 32-bit and 64-bit data elements I'd also expect small penalty. On today's CPU
> for 32-bit elements I'd expect *less* than 10%. Not so for 256-bit elements.
>
> As to doing my own microbenchmark, may be, tomorrow.
> It's a night here now.
>
So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
Here are results:
8-byte (64b) accesses:
0 1064
1 1170
2 1170
3 1170
4 1170
5 1170
6 1170
7 1170
8 1071
9 1171
10 1171
11 1171
12 1171
13 1171
14 1171
15 1171
16 1067
17 1170
18 1170
19 1170
20 1170
21 1170
22 1170
23 1170
24 1065
25 1170
26 1170
27 1170
28 1170
29 1170
30 1170
31 1170
16-byte (128b) accesses:
0 483
1 701
2 701
3 701
4 701
5 701
6 701
7 701
8 701
9 701
10 701
11 702
12 701
13 701
14 701
15 701
16 483
17 702
18 702
19 702
20 702
21 702
22 701
23 702
24 701
25 702
26 702
27 702
28 702
29 701
30 702
31 701
32-byte (256b) accesses:
0 256
1 468
2 468
3 468
4 468
5 468
6 468
7 468
8 468
9 468
10 468
11 468
12 468
13 468
14 468
15 468
16 468
17 468
18 468
19 468
20 468
21 468
22 468
23 468
24 468
25 468
26 468
27 468
28 468
29 468
30 468
31 468
Misalignment penalty [of streaming add):
8-byte - 1.10x
16-byte - 1.45x
32-byte - 1.83x
So, on this particular CPU the penalty is even bigger than what I expected.
Quite possibly, on SKX with 512b accesses the penalty would be over 2x. Unfortunately, right now I have no access to SKX.