By: Michael S (already5chosen.delete@this.yahoo.com), October 1, 2021 5:04 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on October 1, 2021 4:38 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 5:20 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on September 30, 2021 2:56 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> > > > Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > > > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > > > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > > > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > > > > out of the "middle" loop, and the length is not zero, do
> > > > > > a "last"). Internalizing that stuff would probably
> > > > > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> > > > >
> > > > >
> > > > > Sure detecting that stuff is trivial, but they would effectively be three separate
> > > > > operations internally as you outline. So why not make that explicit and reduce
> > > > > the amount of state you have to carry when the operation is interrupted?
> > > >
> > > > If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
> > > >
> > > > That being said, the separation is at least a bit artificial, and that presents at least a
> > > > few potential problem areas. First, the three instruction scheme makes optimizing fairly
> > > > short memcpy()s difficult - you'll have to execute all three instructions no matter what.
> > > >
> > > > At least for simple implementations, requiring the middle and final instruction operate on aligned
> > > > words (at least for the destination) poses some challenges around page boundaries. If nothing
> > > > else, having to store a full aligned word in every cycle will require that crossing a page boundary
> > > > be able to handle three page faults. If the state requirements were looser, the instruction could
> > > > step more delicately over a page boundary, eliminating the need to handle the third page fault.
> > > > A truly high end implementation may care about that less than a "medium" one.
> > > >
> > > > Also separating the start, middle and end instructions requires that they either architect the
> > > > state those store or use, or you'll have trouble migrating running code to cores that might
> > > > have different implementations (say from a big to a little core, to another core in a cluster,
> > > > or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> > > > word size, which may impact future implementations (either by limiting their word sizes, or
> > > > by requiring their "middle" instruction to handle partially aligned operands).
> > >
> > >
> > > If an implementation wants to minimize issues with big core / little core interaction, e.g. have
> > > a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost
> > > much for the little cores to use a slightly more restrictive alignment than would otherwise be
> > > necessary for their narrower engine. That way in progress instructions interrupted on a core
> > > can be continued on a core of a different size without any special casing required.
> > >
> > > The alignment that a "middle" instruction expects to result from the completed execution of a "start"
> > > instruction is something the end user doesn't need to know or care about. It doesn't matter to me
> > > if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or
> > > 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit
> > > hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions
> > > handle that just as automatically as a single instruction would. Same for VM migration, the "start"
> > > instruction will create whatever alignment that the "middle" instruction expects.
> > >
> > > You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32
> > > bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish
> > > instructions will do what is required on that hardware. The three instructions can handle the
> > > implementation details just as well as a single do it all instruction could.
> >
> >
> > Then the smaller implementation needs to be able to handle longer start and end moves
> > than would be natural. Sure you could, but it seems a bit painful - begin (and end)
> > would then basically be the merged instructions anyway. Across a cluster remains an
> > issue, unless any little CPUs could have their minimum alignment adjusted by the OS.
>
> It's a real conundrum okay. It isn't at all obvious what on earth they intend to do.
>
> My latest theory is that all three instructions form a unit and if there is an interrupt the restart
> is at the first instruction.
No chance.
> The reason for having three instructions is because they update three
> registers and want to do all the register allocation work easily in the decode stage. Tere might
> be some work actually associate with the threeoperations as they go down the pipeline, if so the
> first would simply analyze the registers to decide what needs to be done, the second would iterate
> doing a move and be interruptable. And when the second ends or is interrupted the third updates the
> registers and sets the interrupt point to the first instruction in the case of an interrupt.
It has to be something very prosaic.
Like, first instruction brings destination to [coarse] aligned boundary etc...
Each instruction updates all three registers, exactly like 'rep movs'.
Very likely, only middle instruction is interruptable/restartable.
The only interesting question is whether alignment boundary is architected to be 512-bit or implementation-defined with very small set of legal choices.
Also I have very little doubt that in overlapped case semantics for a middle instruction are *not* memmove().
> rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 5:20 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on September 30, 2021 2:56 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> > > > Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > > > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > > > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > > > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > > > > out of the "middle" loop, and the length is not zero, do
> > > > > > a "last"). Internalizing that stuff would probably
> > > > > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> > > > >
> > > > >
> > > > > Sure detecting that stuff is trivial, but they would effectively be three separate
> > > > > operations internally as you outline. So why not make that explicit and reduce
> > > > > the amount of state you have to carry when the operation is interrupted?
> > > >
> > > > If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
> > > >
> > > > That being said, the separation is at least a bit artificial, and that presents at least a
> > > > few potential problem areas. First, the three instruction scheme makes optimizing fairly
> > > > short memcpy()s difficult - you'll have to execute all three instructions no matter what.
> > > >
> > > > At least for simple implementations, requiring the middle and final instruction operate on aligned
> > > > words (at least for the destination) poses some challenges around page boundaries. If nothing
> > > > else, having to store a full aligned word in every cycle will require that crossing a page boundary
> > > > be able to handle three page faults. If the state requirements were looser, the instruction could
> > > > step more delicately over a page boundary, eliminating the need to handle the third page fault.
> > > > A truly high end implementation may care about that less than a "medium" one.
> > > >
> > > > Also separating the start, middle and end instructions requires that they either architect the
> > > > state those store or use, or you'll have trouble migrating running code to cores that might
> > > > have different implementations (say from a big to a little core, to another core in a cluster,
> > > > or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> > > > word size, which may impact future implementations (either by limiting their word sizes, or
> > > > by requiring their "middle" instruction to handle partially aligned operands).
> > >
> > >
> > > If an implementation wants to minimize issues with big core / little core interaction, e.g. have
> > > a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost
> > > much for the little cores to use a slightly more restrictive alignment than would otherwise be
> > > necessary for their narrower engine. That way in progress instructions interrupted on a core
> > > can be continued on a core of a different size without any special casing required.
> > >
> > > The alignment that a "middle" instruction expects to result from the completed execution of a "start"
> > > instruction is something the end user doesn't need to know or care about. It doesn't matter to me
> > > if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or
> > > 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit
> > > hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions
> > > handle that just as automatically as a single instruction would. Same for VM migration, the "start"
> > > instruction will create whatever alignment that the "middle" instruction expects.
> > >
> > > You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32
> > > bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish
> > > instructions will do what is required on that hardware. The three instructions can handle the
> > > implementation details just as well as a single do it all instruction could.
> >
> >
> > Then the smaller implementation needs to be able to handle longer start and end moves
> > than would be natural. Sure you could, but it seems a bit painful - begin (and end)
> > would then basically be the merged instructions anyway. Across a cluster remains an
> > issue, unless any little CPUs could have their minimum alignment adjusted by the OS.
>
> It's a real conundrum okay. It isn't at all obvious what on earth they intend to do.
>
> My latest theory is that all three instructions form a unit and if there is an interrupt the restart
> is at the first instruction.
No chance.
> The reason for having three instructions is because they update three
> registers and want to do all the register allocation work easily in the decode stage. Tere might
> be some work actually associate with the threeoperations as they go down the pipeline, if so the
> first would simply analyze the registers to decide what needs to be done, the second would iterate
> doing a move and be interruptable. And when the second ends or is interrupted the third updates the
> registers and sets the interrupt point to the first instruction in the case of an interrupt.
It has to be something very prosaic.
Like, first instruction brings destination to [coarse] aligned boundary etc...
Each instruction updates all three registers, exactly like 'rep movs'.
Very likely, only middle instruction is interruptable/restartable.
The only interesting question is whether alignment boundary is architected to be 512-bit or implementation-defined with very small set of legal choices.
Also I have very little doubt that in overlapped case semantics for a middle instruction are *not* memmove().