By: rwessel (rwessel.delete@this.yahoo.com), September 30, 2021 10:39 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on September 30, 2021 9:48 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > out of the "middle" loop, and the length is not zero, do
> > a "last"). Internalizing that stuff would probably
> > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
>
>
> Sure detecting that stuff is trivial, but they would effectively be three separate
> operations internally as you outline. So why not make that explicit and reduce
> the amount of state you have to carry when the operation is interrupted?
If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
That being said, the separation is at least a bit artificial, and that presents at least a few potential problem areas. First, the three instruction scheme makes optimizing fairly short memcpy()s difficult - you'll have to execute all three instructions no matter what.
At least for simple implementations, requiring the middle and final instruction operate on aligned words (at least for the destination) poses some challenges around page boundaries. If nothing else, having to store a full aligned word in every cycle will require that crossing a page boundary be able to handle three page faults. If the state requirements were looser, the instruction could step more delicately over a page boundary, eliminating the need to handle the third page fault. A truly high end implementation may care about that less than a "medium" one.
Also separating the start, middle and end instructions requires that they either architect the state those store or use, or you'll have trouble migrating running code to cores that might have different implementations (say from a big to a little core, to another core in a cluster, or a VM migration). And if you architect them, you run the risk of fixing things like the effective word size, which may impact future implementations (either by limiting their word sizes, or by requiring their "middle" instruction to handle partially aligned operands).
> This is still in theory a RISC ISA after all, even though John Cocke might have considered these
> instructions (whether one or three) an immediate disqualification from being considered as such :)
I've certainly never been a RISC purist, whatever the heck that might mean this week. But essentially all code I work on spends a whole lot more time move storage around than doing (say) FP multiplies, or any vector stuff, and look at how many transistors people have been wiling to burn for that (not that I object to those facilities).
> rwessel (rwessel.delete@this.yahoo.com) on September 29, 2021 11:22 pm wrote:
> > Certainly. But I still don't see the point of the separate setup and finalize instructions - detecting
> > those conditions is trivial (if the destination address has any low bits set, do "first", if you've fallen
> > out of the "middle" loop, and the length is not zero, do
> > a "last"). Internalizing that stuff would probably
> > make it easier to sneak up on page boundaries as well, at least for simpler implementations.
>
>
> Sure detecting that stuff is trivial, but they would effectively be three separate
> operations internally as you outline. So why not make that explicit and reduce
> the amount of state you have to carry when the operation is interrupted?
If it leads to good memcpy() performance, with simple implementations, I'm all for whatever they've done.
That being said, the separation is at least a bit artificial, and that presents at least a few potential problem areas. First, the three instruction scheme makes optimizing fairly short memcpy()s difficult - you'll have to execute all three instructions no matter what.
At least for simple implementations, requiring the middle and final instruction operate on aligned words (at least for the destination) poses some challenges around page boundaries. If nothing else, having to store a full aligned word in every cycle will require that crossing a page boundary be able to handle three page faults. If the state requirements were looser, the instruction could step more delicately over a page boundary, eliminating the need to handle the third page fault. A truly high end implementation may care about that less than a "medium" one.
Also separating the start, middle and end instructions requires that they either architect the state those store or use, or you'll have trouble migrating running code to cores that might have different implementations (say from a big to a little core, to another core in a cluster, or a VM migration). And if you architect them, you run the risk of fixing things like the effective word size, which may impact future implementations (either by limiting their word sizes, or by requiring their "middle" instruction to handle partially aligned operands).
> This is still in theory a RISC ISA after all, even though John Cocke might have considered these
> instructions (whether one or three) an immediate disqualification from being considered as such :)
I've certainly never been a RISC purist, whatever the heck that might mean this week. But essentially all code I work on spends a whole lot more time move storage around than doing (say) FP multiplies, or any vector stuff, and look at how many transistors people have been wiling to burn for that (not that I object to those facilities).