By: Doug S (foo.delete@this.bar.bar), September 30, 2021 2:56 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > out of the "middle" loop, and the length is not zero, do
> > > a "last"). Internalizing that stuff would probably
> > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> >
> >
> > Sure detecting that stuff is trivial, but they would effectively be three separate
> > operations internally as you outline. So why not make that explicit and reduce
> > the amount of state you have to carry when the operation is interrupted?
>
> If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
>
> That being said, the separation is at least a bit artificial, and that presents at least a
> few potential problem areas. First, the three instruction scheme makes optimizing fairly
> short memcpy()s difficult - you'll have to execute all three instructions no matter what.
>
> At least for simple implementations, requiring the middle and final instruction operate on aligned
> words (at least for the destination) poses some challenges around page boundaries. If nothing
> else, having to store a full aligned word in every cycle will require that crossing a page boundary
> be able to handle three page faults. If the state requirements were looser, the instruction could
> step more delicately over a page boundary, eliminating the need to handle the third page fault.
> A truly high end implementation may care about that less than a "medium" one.
>
> Also separating the start, middle and end instructions requires that they either architect the
> state those store or use, or you'll have trouble migrating running code to cores that might
> have different implementations (say from a big to a little core, to another core in a cluster,
> or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> word size, which may impact future implementations (either by limiting their word sizes, or
> by requiring their "middle" instruction to handle partially aligned operands).
If an implementation wants to minimize issues with big core / little core interaction, e.g. have a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost much for the little cores to use a slightly more restrictive alignment than would otherwise be necessary for their narrower engine. That way in progress instructions interrupted on a core can be continued on a core of a different size without any special casing required.
The alignment that a "middle" instruction expects to result from the completed execution of a "start" instruction is something the end user doesn't need to know or care about. It doesn't matter to me if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions handle that just as automatically as a single instruction would. Same for VM migration, the "start" instruction will create whatever alignment that the "middle" instruction expects.
You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32 bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish instructions will do what is required on that hardware. The three instructions can handle the implementation details just as well as a single do it all instruction could.
> Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > out of the "middle" loop, and the length is not zero, do
> > > a "last"). Internalizing that stuff would probably
> > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> >
> >
> > Sure detecting that stuff is trivial, but they would effectively be three separate
> > operations internally as you outline. So why not make that explicit and reduce
> > the amount of state you have to carry when the operation is interrupted?
>
> If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
>
> That being said, the separation is at least a bit artificial, and that presents at least a
> few potential problem areas. First, the three instruction scheme makes optimizing fairly
> short memcpy()s difficult - you'll have to execute all three instructions no matter what.
>
> At least for simple implementations, requiring the middle and final instruction operate on aligned
> words (at least for the destination) poses some challenges around page boundaries. If nothing
> else, having to store a full aligned word in every cycle will require that crossing a page boundary
> be able to handle three page faults. If the state requirements were looser, the instruction could
> step more delicately over a page boundary, eliminating the need to handle the third page fault.
> A truly high end implementation may care about that less than a "medium" one.
>
> Also separating the start, middle and end instructions requires that they either architect the
> state those store or use, or you'll have trouble migrating running code to cores that might
> have different implementations (say from a big to a little core, to another core in a cluster,
> or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> word size, which may impact future implementations (either by limiting their word sizes, or
> by requiring their "middle" instruction to handle partially aligned operands).
If an implementation wants to minimize issues with big core / little core interaction, e.g. have a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost much for the little cores to use a slightly more restrictive alignment than would otherwise be necessary for their narrower engine. That way in progress instructions interrupted on a core can be continued on a core of a different size without any special casing required.
The alignment that a "middle" instruction expects to result from the completed execution of a "start" instruction is something the end user doesn't need to know or care about. It doesn't matter to me if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions handle that just as automatically as a single instruction would. Same for VM migration, the "start" instruction will create whatever alignment that the "middle" instruction expects.
You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32 bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish instructions will do what is required on that hardware. The three instructions can handle the implementation details just as well as a single do it all instruction could.