By: Doug S (foo.delete@this.bar.bar), September 19, 2021 6:07 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on September 19, 2021 1:06 pm wrote:
> The Exclamation means the register updates, and I think F means Forward.
> My guess is the middle instruction does vector aligned copies, but
> that can change per CPU design, so you always need all three?
The question is, "aligned with what"? Everything you could use is implementation specific. Widest vector register? Different for SVE2 implementations that handle the minimum 128 bits natively versus others that handle 256, 512 or more natively. Maximum width the load/store unit can sustain? Width of a cache line - and if so, what level? Even page alignment wouldn't be good enough, since again that's implementation specific.
As dmcq points out, you don't want to create a dependency on cache line width. Or the width of anything they could possibly use, unless they dumb it down and say "128 bits", which would unnecessarily limit implementations that already could or will want to do better.
The easiest way to avoid all this is to REQUIRE all three instructions be used in all cases. The implementation can treat the first and/or third instruction as a no-op if things are aligned but that way the code doesn't need to special case anything or limit the ability of future implementations to make different alignment decisions.
Worst case it costs you 8 bytes of unnecessary instructions being treated as a no-op.
> The Exclamation means the register updates, and I think F means Forward.
> My guess is the middle instruction does vector aligned copies, but
> that can change per CPU design, so you always need all three?
The question is, "aligned with what"? Everything you could use is implementation specific. Widest vector register? Different for SVE2 implementations that handle the minimum 128 bits natively versus others that handle 256, 512 or more natively. Maximum width the load/store unit can sustain? Width of a cache line - and if so, what level? Even page alignment wouldn't be good enough, since again that's implementation specific.
As dmcq points out, you don't want to create a dependency on cache line width. Or the width of anything they could possibly use, unless they dumb it down and say "128 bits", which would unnecessarily limit implementations that already could or will want to do better.
The easiest way to avoid all this is to REQUIRE all three instructions be used in all cases. The implementation can treat the first and/or third instruction as a no-op if things are aligned but that way the code doesn't need to special case anything or limit the ability of future implementations to make different alignment decisions.
Worst case it costs you 8 bytes of unnecessary instructions being treated as a no-op.