By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), February 23, 2021 12:44 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on February 23, 2021 9:33 am wrote:
> No one would object to rep/movsb being slow if the operands overlap.
Noe that one problem with rep/movsb is that "overlap" is hard to figure out.
They might not overlap in virtual memory, but still overlap in physical pages.
And while that doesn't matter for memcpy, it does matter for movsb, which is technically defined even for that case.
Same goes for MMIO memory. If you do memcpy() on MMIO memory, you get whatever random end results. But for movsb it's actually acrhitecturally defined, and usually not what you want (ie the definition is the "go slow, one byte at a time").
So when memcpy can make decisions based on just comparing addresses and can say "screw it" to both physical aliasing and MMIO, movsb needs to actually do a TLB probe.
I think "rep movsb" is really really close to being the perfect hardware interface for "memcpy", but the above issues and the DF bit do make for it being much harder to just generate simpler optimal ucode.
So the best option might be to specify a new instruction that looks exactly like "rep movsb" but specifies that DF is ignored, and that it might be moving things in bigger chunks (so that MMIO and physical aliases get the "memcpy" semantics, not the "byte at a time" ones).
One potential way would be to say "doubled-up rep prefix means new semantics", and have a bit in CR4 enable it.
Linus
> No one would object to rep/movsb being slow if the operands overlap.
Noe that one problem with rep/movsb is that "overlap" is hard to figure out.
They might not overlap in virtual memory, but still overlap in physical pages.
And while that doesn't matter for memcpy, it does matter for movsb, which is technically defined even for that case.
Same goes for MMIO memory. If you do memcpy() on MMIO memory, you get whatever random end results. But for movsb it's actually acrhitecturally defined, and usually not what you want (ie the definition is the "go slow, one byte at a time").
So when memcpy can make decisions based on just comparing addresses and can say "screw it" to both physical aliasing and MMIO, movsb needs to actually do a TLB probe.
I think "rep movsb" is really really close to being the perfect hardware interface for "memcpy", but the above issues and the DF bit do make for it being much harder to just generate simpler optimal ucode.
So the best option might be to specify a new instruction that looks exactly like "rep movsb" but specifies that DF is ignored, and that it might be moving things in bigger chunks (so that MMIO and physical aliases get the "memcpy" semantics, not the "byte at a time" ones).
One potential way would be to say "doubled-up rep prefix means new semantics", and have a bit in CR4 enable it.
Linus