By: dmcq (dmcq.delete@this.fano.co.uk), September 28, 2021 9:54 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on September 19, 2021 6:07 pm wrote:
> Brett (ggtgp.delete@this.yahoo.com) on September 19, 2021 1:06 pm wrote:
> > The Exclamation means the register updates, and I think F means Forward.
> > My guess is the middle instruction does vector aligned copies, but
> > that can change per CPU design, so you always need all three?
>
>
> The question is, "aligned with what"? Everything you could use is implementation specific. Widest vector register?
> Different for SVE2 implementations that handle the minimum 128 bits natively versus others that handle 256,
> 512 or more natively. Maximum width the load/store unit can sustain? Width of a cache line - and if so, what
> level? Even page alignment wouldn't be good enough, since again that's implementation specific.
>
> As dmcq points out, you don't want to create a dependency on cache line width. Or the width
> of anything they could possibly use, unless they dumb it down and say "128 bits", which would
> unnecessarily limit implementations that already could or will want to do better.
>
> The easiest way to avoid all this is to REQUIRE all three instructions be used in all
> cases. The implementation can treat the first and/or third instruction as a no-op if
> things are aligned but that way the code doesn't need to special case anything or limit
> the ability of future implementations to make different alignment decisions.
>
> Worst case it costs you 8 bytes of unnecessary instructions being treated as a no-op.
Having looked at this again, I think they should not have pre and post operations, just have a single operation for move for instance. The instruction should be interruptable and restarted if not complete. To achieve the effect of the three different operations they could use up some bits in the program state - these should be zero normally but could be set to various other values depending on an analysis of the move. For instance if there is overlap and the move should be done backwards then a bit could be set to indicate that. The bits should be set to zero again when the whole operation finishes. The ARM Thumb mode already does that for the IT instruction which does predication of a couple of following instructions. Not the prettiest RISC facility but much better than sticking in three instructions I think.
> Brett (ggtgp.delete@this.yahoo.com) on September 19, 2021 1:06 pm wrote:
> > The Exclamation means the register updates, and I think F means Forward.
> > My guess is the middle instruction does vector aligned copies, but
> > that can change per CPU design, so you always need all three?
>
>
> The question is, "aligned with what"? Everything you could use is implementation specific. Widest vector register?
> Different for SVE2 implementations that handle the minimum 128 bits natively versus others that handle 256,
> 512 or more natively. Maximum width the load/store unit can sustain? Width of a cache line - and if so, what
> level? Even page alignment wouldn't be good enough, since again that's implementation specific.
>
> As dmcq points out, you don't want to create a dependency on cache line width. Or the width
> of anything they could possibly use, unless they dumb it down and say "128 bits", which would
> unnecessarily limit implementations that already could or will want to do better.
>
> The easiest way to avoid all this is to REQUIRE all three instructions be used in all
> cases. The implementation can treat the first and/or third instruction as a no-op if
> things are aligned but that way the code doesn't need to special case anything or limit
> the ability of future implementations to make different alignment decisions.
>
> Worst case it costs you 8 bytes of unnecessary instructions being treated as a no-op.
Having looked at this again, I think they should not have pre and post operations, just have a single operation for move for instance. The instruction should be interruptable and restarted if not complete. To achieve the effect of the three different operations they could use up some bits in the program state - these should be zero normally but could be set to various other values depending on an analysis of the move. For instance if there is overlap and the move should be done backwards then a bit could be set to indicate that. The bits should be set to zero again when the whole operation finishes. The ARM Thumb mode already does that for the IT instruction which does predication of a couple of following instructions. Not the prettiest RISC facility but much better than sticking in three instructions I think.