By: rwessel (rwessel.delete@this.yahoo.com), September 30, 2021 5:20 pm
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on September 30, 2021 2:56 pm wrote:
> rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> > Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > > out of the "middle" loop, and the length is not zero, do
> > > > a "last"). Internalizing that stuff would probably
> > > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> > >
> > >
> > > Sure detecting that stuff is trivial, but they would effectively be three separate
> > > operations internally as you outline. So why not make that explicit and reduce
> > > the amount of state you have to carry when the operation is interrupted?
> >
> > If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
> >
> > That being said, the separation is at least a bit artificial, and that presents at least a
> > few potential problem areas. First, the three instruction scheme makes optimizing fairly
> > short memcpy()s difficult - you'll have to execute all three instructions no matter what.
> >
> > At least for simple implementations, requiring the middle and final instruction operate on aligned
> > words (at least for the destination) poses some challenges around page boundaries. If nothing
> > else, having to store a full aligned word in every cycle will require that crossing a page boundary
> > be able to handle three page faults. If the state requirements were looser, the instruction could
> > step more delicately over a page boundary, eliminating the need to handle the third page fault.
> > A truly high end implementation may care about that less than a "medium" one.
> >
> > Also separating the start, middle and end instructions requires that they either architect the
> > state those store or use, or you'll have trouble migrating running code to cores that might
> > have different implementations (say from a big to a little core, to another core in a cluster,
> > or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> > word size, which may impact future implementations (either by limiting their word sizes, or
> > by requiring their "middle" instruction to handle partially aligned operands).
>
>
> If an implementation wants to minimize issues with big core / little core interaction, e.g. have
> a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost
> much for the little cores to use a slightly more restrictive alignment than would otherwise be
> necessary for their narrower engine. That way in progress instructions interrupted on a core
> can be continued on a core of a different size without any special casing required.
>
> The alignment that a "middle" instruction expects to result from the completed execution of a "start"
> instruction is something the end user doesn't need to know or care about. It doesn't matter to me
> if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or
> 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit
> hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions
> handle that just as automatically as a single instruction would. Same for VM migration, the "start"
> instruction will create whatever alignment that the "middle" instruction expects.
>
> You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32
> bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish
> instructions will do what is required on that hardware. The three instructions can handle the
> implementation details just as well as a single do it all instruction could.
Then the smaller implementation needs to be able to handle longer start and end moves than would be natural. Sure you could, but it seems a bit painful - begin (and end) would then basically be the merged instructions anyway. Across a cluster remains an issue, unless any little CPUs could have their minimum alignment adjusted by the OS.
> rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> > Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > > out of the "middle" loop, and the length is not zero, do
> > > > a "last"). Internalizing that stuff would probably
> > > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> > >
> > >
> > > Sure detecting that stuff is trivial, but they would effectively be three separate
> > > operations internally as you outline. So why not make that explicit and reduce
> > > the amount of state you have to carry when the operation is interrupted?
> >
> > If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
> >
> > That being said, the separation is at least a bit artificial, and that presents at least a
> > few potential problem areas. First, the three instruction scheme makes optimizing fairly
> > short memcpy()s difficult - you'll have to execute all three instructions no matter what.
> >
> > At least for simple implementations, requiring the middle and final instruction operate on aligned
> > words (at least for the destination) poses some challenges around page boundaries. If nothing
> > else, having to store a full aligned word in every cycle will require that crossing a page boundary
> > be able to handle three page faults. If the state requirements were looser, the instruction could
> > step more delicately over a page boundary, eliminating the need to handle the third page fault.
> > A truly high end implementation may care about that less than a "medium" one.
> >
> > Also separating the start, middle and end instructions requires that they either architect the
> > state those store or use, or you'll have trouble migrating running code to cores that might
> > have different implementations (say from a big to a little core, to another core in a cluster,
> > or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> > word size, which may impact future implementations (either by limiting their word sizes, or
> > by requiring their "middle" instruction to handle partially aligned operands).
>
>
> If an implementation wants to minimize issues with big core / little core interaction, e.g. have
> a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost
> much for the little cores to use a slightly more restrictive alignment than would otherwise be
> necessary for their narrower engine. That way in progress instructions interrupted on a core
> can be continued on a core of a different size without any special casing required.
>
> The alignment that a "middle" instruction expects to result from the completed execution of a "start"
> instruction is something the end user doesn't need to know or care about. It doesn't matter to me
> if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or
> 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit
> hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions
> handle that just as automatically as a single instruction would. Same for VM migration, the "start"
> instruction will create whatever alignment that the "middle" instruction expects.
>
> You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32
> bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish
> instructions will do what is required on that hardware. The three instructions can handle the
> implementation details just as well as a single do it all instruction could.
Then the smaller implementation needs to be able to handle longer start and end moves than would be natural. Sure you could, but it seems a bit painful - begin (and end) would then basically be the merged instructions anyway. Across a cluster remains an issue, unless any little CPUs could have their minimum alignment adjusted by the OS.