By: rwessel (rwessel.delete@this.yahoo.com), February 23, 2021 1:21 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on February 23, 2021 11:44 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on February 23, 2021 9:33 am wrote:
> > No one would object to rep/movsb being slow if the operands overlap.
>
> Noe that one problem with rep/movsb is that "overlap" is hard to figure out.
>
> They might not overlap in virtual memory, but still overlap in physical pages.
>
> And while that doesn't matter for memcpy, it does matter for
> movsb, which is technically defined even for that case.
Is that really a problem? Just check for aliasing at startup, and as each operand crosses page boundaries.
> Same goes for MMIO memory. If you do memcpy() on MMIO memory, you get whatever random
> end results. But for movsb it's actually acrhitecturally defined, and usually not
> what you want (ie the definition is the "go slow, one byte at a time").
I don't know how you get around that. If you do a byte access to MMIO memory, the hardware *has* to do byte accesses. Using movsq* instead will help some, as would something that supported a bigger word than that (a hypothetical movs128 or movs256 wouldn't require any tricks - the moved word is never exposed, but something like a stos128 might need to reference register pair of one of the vector registers).
> So when memcpy can make decisions based on just comparing addresses and can say "screw
> it" to both physical aliasing and MMIO, movsb needs to actually do a TLB probe.
That has to happen anyway, or at least once movsb tries to actually read or write the data being moved. And the CPU already has to back that sort of thing out in the case of an exception. If overlap is detected there, it backs up and restarts in slow mode. Again, that has to be redone as each operand crosses a page boundary. And that should all happen in the load/store unit.
> I think "rep movsb" is really really close to being the perfect hardware interface for "memcpy", but the
> above issues and the DF bit do make for it being much harder to just generate simpler optimal ucode.
>
> So the best option might be to specify a new instruction that looks exactly like "rep movsb"
> but specifies that DF is ignored, and that it might be moving things in bigger chunks (so that
> MMIO and physical aliases get the "memcpy" semantics, not the "byte at a time" ones).
>
> One potential way would be to say "doubled-up rep prefix
> means new semantics", and have a bit in CR4 enable it.
I could see that helping a bit, particularly on short operands, but I don't think it's necessary (except, perhaps, the DF part). Maybe an eeee!vex prefix to specify different registers as well, if we're dreaming.
I'd also like an instruction with two proper addresses, and a limited register or constant based length (IOW, S/360 MVC). But then I also want a date with Jennifer Aniston. The latter might be more realistic.
*I am assuming that if movsb got fixed, the bigger variants would be fixed too.
> rwessel (rwessel.delete@this.yahoo.com) on February 23, 2021 9:33 am wrote:
> > No one would object to rep/movsb being slow if the operands overlap.
>
> Noe that one problem with rep/movsb is that "overlap" is hard to figure out.
>
> They might not overlap in virtual memory, but still overlap in physical pages.
>
> And while that doesn't matter for memcpy, it does matter for
> movsb, which is technically defined even for that case.
Is that really a problem? Just check for aliasing at startup, and as each operand crosses page boundaries.
> Same goes for MMIO memory. If you do memcpy() on MMIO memory, you get whatever random
> end results. But for movsb it's actually acrhitecturally defined, and usually not
> what you want (ie the definition is the "go slow, one byte at a time").
I don't know how you get around that. If you do a byte access to MMIO memory, the hardware *has* to do byte accesses. Using movsq* instead will help some, as would something that supported a bigger word than that (a hypothetical movs128 or movs256 wouldn't require any tricks - the moved word is never exposed, but something like a stos128 might need to reference register pair of one of the vector registers).
> So when memcpy can make decisions based on just comparing addresses and can say "screw
> it" to both physical aliasing and MMIO, movsb needs to actually do a TLB probe.
That has to happen anyway, or at least once movsb tries to actually read or write the data being moved. And the CPU already has to back that sort of thing out in the case of an exception. If overlap is detected there, it backs up and restarts in slow mode. Again, that has to be redone as each operand crosses a page boundary. And that should all happen in the load/store unit.
> I think "rep movsb" is really really close to being the perfect hardware interface for "memcpy", but the
> above issues and the DF bit do make for it being much harder to just generate simpler optimal ucode.
>
> So the best option might be to specify a new instruction that looks exactly like "rep movsb"
> but specifies that DF is ignored, and that it might be moving things in bigger chunks (so that
> MMIO and physical aliases get the "memcpy" semantics, not the "byte at a time" ones).
>
> One potential way would be to say "doubled-up rep prefix
> means new semantics", and have a bit in CR4 enable it.
I could see that helping a bit, particularly on short operands, but I don't think it's necessary (except, perhaps, the DF part). Maybe an eeee!vex prefix to specify different registers as well, if we're dreaming.
I'd also like an instruction with two proper addresses, and a limited register or constant based length (IOW, S/360 MVC). But then I also want a date with Jennifer Aniston. The latter might be more realistic.
*I am assuming that if movsb got fixed, the bigger variants would be fixed too.