By: rwessel (rwessel.delete@this.yahoo.com), October 1, 2021 5:10 am
Room: Moderated Discussions
dmcq (dmcq.delete@this.fano.co.uk) on October 1, 2021 4:38 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 5:20 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on September 30, 2021 2:56 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> > > > Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > > > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > > > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > > > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > > > > out of the "middle" loop, and the length is not zero, do
> > > > > > a "last"). Internalizing that stuff would probably
> > > > > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> > > > >
> > > > >
> > > > > Sure detecting that stuff is trivial, but they would effectively be three separate
> > > > > operations internally as you outline. So why not make that explicit and reduce
> > > > > the amount of state you have to carry when the operation is interrupted?
> > > >
> > > > If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
> > > >
> > > > That being said, the separation is at least a bit artificial, and that presents at least a
> > > > few potential problem areas. First, the three instruction scheme makes optimizing fairly
> > > > short memcpy()s difficult - you'll have to execute all three instructions no matter what.
> > > >
> > > > At least for simple implementations, requiring the middle and final instruction operate on aligned
> > > > words (at least for the destination) poses some challenges around page boundaries. If nothing
> > > > else, having to store a full aligned word in every cycle will require that crossing a page boundary
> > > > be able to handle three page faults. If the state requirements were looser, the instruction could
> > > > step more delicately over a page boundary, eliminating the need to handle the third page fault.
> > > > A truly high end implementation may care about that less than a "medium" one.
> > > >
> > > > Also separating the start, middle and end instructions requires that they either architect the
> > > > state those store or use, or you'll have trouble migrating running code to cores that might
> > > > have different implementations (say from a big to a little core, to another core in a cluster,
> > > > or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> > > > word size, which may impact future implementations (either by limiting their word sizes, or
> > > > by requiring their "middle" instruction to handle partially aligned operands).
> > >
> > >
> > > If an implementation wants to minimize issues with big core / little core interaction, e.g. have
> > > a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost
> > > much for the little cores to use a slightly more restrictive alignment than would otherwise be
> > > necessary for their narrower engine. That way in progress instructions interrupted on a core
> > > can be continued on a core of a different size without any special casing required.
> > >
> > > The alignment that a "middle" instruction expects to result from the completed execution of a "start"
> > > instruction is something the end user doesn't need to know or care about. It doesn't matter to me
> > > if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or
> > > 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit
> > > hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions
> > > handle that just as automatically as a single instruction would. Same for VM migration, the "start"
> > > instruction will create whatever alignment that the "middle" instruction expects.
> > >
> > > You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32
> > > bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish
> > > instructions will do what is required on that hardware. The three instructions can handle the
> > > implementation details just as well as a single do it all instruction could.
> >
> >
> > Then the smaller implementation needs to be able to handle longer start and end moves
> > than would be natural. Sure you could, but it seems a bit painful - begin (and end)
> > would then basically be the merged instructions anyway. Across a cluster remains an
> > issue, unless any little CPUs could have their minimum alignment adjusted by the OS.
>
> It's a real conundrum okay. It isn't at all obvious what on earth they intend to do.
>
> My latest theory is that all three instructions form a unit and if there is an interrupt the restart
> is at the first instruction. The reason for having three instructions is because they update three
> registers and want to do all the register allocation work easily in the decode stage. Tere might
> be some work actually associate with the threeoperations as they go down the pipeline, if so the
> first would simply analyze the registers to decide what needs to be done, the second would iterate
> doing a move and be interruptable. And when the second ends or is interrupted the third updates the
> registers and sets the interrupt point to the first instruction in the case of an interrupt.
Always going back to the preconditioning instruction for any interrupt would deal with the problems, but introduces a few of its own. The minor one is the extra overhead of needing to re-execute the preconditioning instruction after any interrupt. Another is that an interrupt between the move and finalize instruction, not just an interruption *of* the move instruction, needs to go back to precondition (on the assumption that finalize moves the last partial word), since a move to a core with larger alignment requirements at that point would need additional work. A similar problem exists between preconditioning and move, but that could be dealt with by allowing move to generate an interrupt if the incoming registers aren't actually aligned properly. Perhaps that can be dealt with by not allowing an interruption between move and finalize.
There is precedent for not allowing interrupts between certain pairs of instructions - x86 moves to segment registers and the following instruction, for example, but this would be a bit weird in allowing interruption during the move instruction, but not between precondition and move or move and finalize. And even that brings issues - what if this sequence spans a page boundary?
More serious is that you now have to define what the heck "back up a few instructions" actually means, especially if the instructions don't have to be sequential. And if they require them to be sequential, haven't they just made this a variable length ISA, by defining a 96-bit instruction? Either that, or they've just re-invented branch-delay slots.
> rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 5:20 pm wrote:
> > Doug S (foo.delete@this.bar.bar) on September 30, 2021 2:56 pm wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on September 30, 2021 10:39 am wrote:
> > > > Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> > > > > rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > > > > > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > > > > > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > > > > > out of the "middle" loop, and the length is not zero, do
> > > > > > a "last"). Internalizing that stuff would probably
> > > > > > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
> > > > >
> > > > >
> > > > > Sure detecting that stuff is trivial, but they would effectively be three separate
> > > > > operations internally as you outline. So why not make that explicit and reduce
> > > > > the amount of state you have to carry when the operation is interrupted?
> > > >
> > > > If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
> > > >
> > > > That being said, the separation is at least a bit artificial, and that presents at least a
> > > > few potential problem areas. First, the three instruction scheme makes optimizing fairly
> > > > short memcpy()s difficult - you'll have to execute all three instructions no matter what.
> > > >
> > > > At least for simple implementations, requiring the middle and final instruction operate on aligned
> > > > words (at least for the destination) poses some challenges around page boundaries. If nothing
> > > > else, having to store a full aligned word in every cycle will require that crossing a page boundary
> > > > be able to handle three page faults. If the state requirements were looser, the instruction could
> > > > step more delicately over a page boundary, eliminating the need to handle the third page fault.
> > > > A truly high end implementation may care about that less than a "medium" one.
> > > >
> > > > Also separating the start, middle and end instructions requires that they either architect the
> > > > state those store or use, or you'll have trouble migrating running code to cores that might
> > > > have different implementations (say from a big to a little core, to another core in a cluster,
> > > > or a VM migration). And if you architect them, you run the risk of fixing things like the effective
> > > > word size, which may impact future implementations (either by limiting their word sizes, or
> > > > by requiring their "middle" instruction to handle partially aligned operands).
> > >
> > >
> > > If an implementation wants to minimize issues with big core / little core interaction, e.g. have
> > > a larger width in the big core, then it both cores will use the larger alignment. Doesn't cost
> > > much for the little cores to use a slightly more restrictive alignment than would otherwise be
> > > necessary for their narrower engine. That way in progress instructions interrupted on a core
> > > can be continued on a core of a different size without any special casing required.
> > >
> > > The alignment that a "middle" instruction expects to result from the completed execution of a "start"
> > > instruction is something the end user doesn't need to know or care about. It doesn't matter to me
> > > if I am doing a memory copy on a core that wants a 64 bit alignment to do copies in 64 bit hunks or
> > > 256 bit alignment to do copies in 256 bit hunks. Or wants a 256 bit alignment to do copies in 64 bit
> > > hunks (i.e. a little core on a machine with big cores that operate on 256 bit hunks) The three instructions
> > > handle that just as automatically as a single instruction would. Same for VM migration, the "start"
> > > instruction will create whatever alignment that the "middle" instruction expects.
> > >
> > > You could (theoretically) run the same ARMv9 binary on a watch SoC that maybe has only a 32
> > > bit wide memory bus or a supercomputer that has a 2048 bit wide memory bus, and the start/middle/finish
> > > instructions will do what is required on that hardware. The three instructions can handle the
> > > implementation details just as well as a single do it all instruction could.
> >
> >
> > Then the smaller implementation needs to be able to handle longer start and end moves
> > than would be natural. Sure you could, but it seems a bit painful - begin (and end)
> > would then basically be the merged instructions anyway. Across a cluster remains an
> > issue, unless any little CPUs could have their minimum alignment adjusted by the OS.
>
> It's a real conundrum okay. It isn't at all obvious what on earth they intend to do.
>
> My latest theory is that all three instructions form a unit and if there is an interrupt the restart
> is at the first instruction. The reason for having three instructions is because they update three
> registers and want to do all the register allocation work easily in the decode stage. Tere might
> be some work actually associate with the threeoperations as they go down the pipeline, if so the
> first would simply analyze the registers to decide what needs to be done, the second would iterate
> doing a move and be interruptable. And when the second ends or is interrupted the third updates the
> registers and sets the interrupt point to the first instruction in the case of an interrupt.
Always going back to the preconditioning instruction for any interrupt would deal with the problems, but introduces a few of its own. The minor one is the extra overhead of needing to re-execute the preconditioning instruction after any interrupt. Another is that an interrupt between the move and finalize instruction, not just an interruption *of* the move instruction, needs to go back to precondition (on the assumption that finalize moves the last partial word), since a move to a core with larger alignment requirements at that point would need additional work. A similar problem exists between preconditioning and move, but that could be dealt with by allowing move to generate an interrupt if the incoming registers aren't actually aligned properly. Perhaps that can be dealt with by not allowing an interruption between move and finalize.
There is precedent for not allowing interrupts between certain pairs of instructions - x86 moves to segment registers and the following instruction, for example, but this would be a bit weird in allowing interruption during the move instruction, but not between precondition and move or move and finalize. And even that brings issues - what if this sequence spans a page boundary?
More serious is that you now have to define what the heck "back up a few instructions" actually means, especially if the instructions don't have to be sequential. And if they require them to be sequential, haven't they just made this a variable length ISA, by defining a 96-bit instruction? Either that, or they've just re-invented branch-delay slots.