By: Doug S (foo.delete@this.bar.bar), September 28, 2021 1:57 pm
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on September 28, 2021 9:54 am wrote:
> Doug S (foo.delete@this.bar.bar) on September 19, 2021 6:07 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on September 19, 2021 1:06 pm wrote:
> > > The Exclamation means the register updates, and I think F means Forward.
> > > My guess is the middle instruction does vector aligned copies, but
> > > that can change per CPU design, so you always need all three?
> >
> >
> > The question is, "aligned with what"? Everything you could
> > use is implementation specific. Widest vector register?
> > Different for SVE2 implementations that handle the minimum 128 bits natively versus others that handle 256,
> > 512 or more natively. Maximum width the load/store unit
> > can sustain? Width of a cache line - and if so, what
> > level? Even page alignment wouldn't be good enough, since again that's implementation specific.
> >
> > As dmcq points out, you don't want to create a dependency on cache line width. Or the width
> > of anything they could possibly use, unless they dumb it down and say "128 bits", which would
> > unnecessarily limit implementations that already could or will want to do better.
> >
> > The easiest way to avoid all this is to REQUIRE all three instructions be used in all
> > cases. The implementation can treat the first and/or third instruction as a no-op if
> > things are aligned but that way the code doesn't need to special case anything or limit
> > the ability of future implementations to make different alignment decisions.
> >
> > Worst case it costs you 8 bytes of unnecessary instructions being treated as a no-op.
>
> Having looked at this again, I think they should not have pre and post operations, just have a single
> operation for move for instance. The instruction should be interruptable and restarted if not complete.
> To achieve the effect of the three different operations they could use up some bits in the program state
> - these should be zero normally but could be set to various other values depending on an analysis of the
> move. For instance if there is overlap and the move should be done backwards then a bit could be set to
> indicate that. The bits should be set to zero again when the whole operation finishes. The ARM Thumb mode
> already does that for the IT instruction which does predication of a couple of following instructions.
> Not the prettiest RISC facility but much better than sticking in three instructions I think.
What's the downside you think you'd be avoiding by doing this beyond avoiding issuing two more instructions?
> Doug S (foo.delete@this.bar.bar) on September 19, 2021 6:07 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on September 19, 2021 1:06 pm wrote:
> > > The Exclamation means the register updates, and I think F means Forward.
> > > My guess is the middle instruction does vector aligned copies, but
> > > that can change per CPU design, so you always need all three?
> >
> >
> > The question is, "aligned with what"? Everything you could
> > use is implementation specific. Widest vector register?
> > Different for SVE2 implementations that handle the minimum 128 bits natively versus others that handle 256,
> > 512 or more natively. Maximum width the load/store unit
> > can sustain? Width of a cache line - and if so, what
> > level? Even page alignment wouldn't be good enough, since again that's implementation specific.
> >
> > As dmcq points out, you don't want to create a dependency on cache line width. Or the width
> > of anything they could possibly use, unless they dumb it down and say "128 bits", which would
> > unnecessarily limit implementations that already could or will want to do better.
> >
> > The easiest way to avoid all this is to REQUIRE all three instructions be used in all
> > cases. The implementation can treat the first and/or third instruction as a no-op if
> > things are aligned but that way the code doesn't need to special case anything or limit
> > the ability of future implementations to make different alignment decisions.
> >
> > Worst case it costs you 8 bytes of unnecessary instructions being treated as a no-op.
>
> Having looked at this again, I think they should not have pre and post operations, just have a single
> operation for move for instance. The instruction should be interruptable and restarted if not complete.
> To achieve the effect of the three different operations they could use up some bits in the program state
> - these should be zero normally but could be set to various other values depending on an analysis of the
> move. For instance if there is overlap and the move should be done backwards then a bit could be set to
> indicate that. The bits should be set to zero again when the whole operation finishes. The ARM Thumb mode
> already does that for the IT instruction which does predication of a couple of following instructions.
> Not the prettiest RISC facility but much better than sticking in three instructions I think.
What's the downside you think you'd be avoiding by doing this beyond avoiding issuing two more instructions?