By: -.- (blarg.delete@this.mailinator.com), September 24, 2021 6:10 pm
Room: Moderated Discussions
Andrey (andrey.semashev.delete@this.gmail.com) on September 24, 2021 8:29 am wrote:
> Should SVE prohibit 136-bit
> vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation,
> 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors
> are nonsensical (possibly, aside from some very special purpose hardware).
Well, from what I can gather, it sounds like you're suggesting that non power-of-2 hardware may, in fact, eventuate. Which is how I interpreted SVE specs - it considers such hardware sensible enough to allow for it.
> By default - nothing. I mean, so long as you're targeting an unknown abstract implementation,
> you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations
> and just write the unaligned loop that maintains correctness.
Which is what I've been trying to portray the whole time really. The way SVE is designed encourages developers to not bother with aligning at all.
If, on the other hand, SVE only allowed power-of-2 widths, a developer could just always align to the vector width in a generic implementation. Perhaps there could be some unusual hardware configuration where that isn't optimal, but I think it'd at least be very close on the vast majority of hardware.
Of course, this doesn't prevent you from special casing scenarios, for example, deciding to align only if the vector width is a power-of-2. But that's about as far as you can go until we get to actually see eccentric hardware to know how to deal with it.
(side note, SVE does allow software to adjust the width down, so 512-bit hardware can be configured to run in 384-bit mode)
> Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future.
So far, it's Intel only, and I feel the core types on Alderlake aren't wildly different enough to be as important (to use your example of MOVDQU/PSHUFB). Of course, this could change in the future.
On the other hand, ARM typically includes both in-order and OoO cores together. Whilst I've never really found a need to specifically target cores yet, my point was that it's generally more difficult to do in the ARM ecosystem.
Considering the added difficulty of targeting specific processors on ARM, I get the feeling that support for such isn't a priority. I don't really think that's a bad idea - the vast majority of programs aren't going to optimise to such an extent - but it does feed into the idea of encouraging generic implementations.
> So, having a CPU with half of
> the cores having 128-bit vectors and the other half - 136-bit ones would be a terrible idea because
> it would perform worse on the existing code (more precisely, half of the cores would perform worse).
Well, neither x86 nor ARM/SVE allow configurations with differing vector widths at all, so that really isn't a concern.
> Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did.
> That sort of combination is a somewhat different story. The cores in such a hybrid are inherently incompatible
I've never heard of such a configuration. The closest thing I've heard of is their K12 core being designed alongside Zen, which was planned to be socket compatible, but that doesn't mean you can run the two together.
Am I missing something?
> Should SVE prohibit 136-bit
> vectors? No, because your cache line and page sizes may be a multiple of 17. In such an implementation,
> 136-bit vectors would be a reasonable choice. But in the power-of-2 world we live in, 136-bit vectors
> are nonsensical (possibly, aside from some very special purpose hardware).
Well, from what I can gather, it sounds like you're suggesting that non power-of-2 hardware may, in fact, eventuate. Which is how I interpreted SVE specs - it considers such hardware sensible enough to allow for it.
> By default - nothing. I mean, so long as you're targeting an unknown abstract implementation,
> you may as well forget about alignment, instruction choices and scheduling and other micro-optimizations
> and just write the unaligned loop that maintains correctness.
Which is what I've been trying to portray the whole time really. The way SVE is designed encourages developers to not bother with aligning at all.
If, on the other hand, SVE only allowed power-of-2 widths, a developer could just always align to the vector width in a generic implementation. Perhaps there could be some unusual hardware configuration where that isn't optimal, but I think it'd at least be very close on the vast majority of hardware.
Of course, this doesn't prevent you from special casing scenarios, for example, deciding to align only if the vector width is a power-of-2. But that's about as far as you can go until we get to actually see eccentric hardware to know how to deal with it.
(side note, SVE does allow software to adjust the width down, so 512-bit hardware can be configured to run in 384-bit mode)
> Yes, heterogeneous cores are a pain, and it looks like x86 will follow suit in the near future.
So far, it's Intel only, and I feel the core types on Alderlake aren't wildly different enough to be as important (to use your example of MOVDQU/PSHUFB). Of course, this could change in the future.
On the other hand, ARM typically includes both in-order and OoO cores together. Whilst I've never really found a need to specifically target cores yet, my point was that it's generally more difficult to do in the ARM ecosystem.
Considering the added difficulty of targeting specific processors on ARM, I get the feeling that support for such isn't a priority. I don't really think that's a bad idea - the vast majority of programs aren't going to optimise to such an extent - but it does feed into the idea of encouraging generic implementations.
> So, having a CPU with half of
> the cores having 128-bit vectors and the other half - 136-bit ones would be a terrible idea because
> it would perform worse on the existing code (more precisely, half of the cores would perform worse).
Well, neither x86 nor ARM/SVE allow configurations with differing vector widths at all, so that really isn't a concern.
> Side note about hybrid designs where different cores are radically different, like x86+ARM that AMD did.
> That sort of combination is a somewhat different story. The cores in such a hybrid are inherently incompatible
I've never heard of such a configuration. The closest thing I've heard of is their K12 core being designed alongside Zen, which was planned to be socket compatible, but that doesn't mean you can run the two together.
Am I missing something?