By: dmcq (dmcq.delete@this.fano.co.uk), September 19, 2021 1:36 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on September 19, 2021 1:06 pm wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on September 17, 2021 8:59 am wrote:
> > Doug S (foo.delete@this.bar.bar) on September 16, 2021 10:57 pm wrote:
> > >
> > > It is easy to declare instructions for memcpy(). The devil is in the details of
> > > the implementation. How long did it take Intel to get it even halfway right?
> >
> > I think the ARM instructions look very reasonable, and are probably
> > not too bad to get right with just a few test cases.
> >
> > Famous last words.
> >
> > The memset case is most definitely the easier of the two,
> > since it doesn't have any issues with two different
> > pointer alignments. And splitting it up into three separate instructions ("initial align, loop over bulk,
> > final partial case") makes a lot of sense. It really should be fairly hard to screw up too badly.
> >
> > memcpy is more complicated, but it really shouldn't be horrible either.
> >
> > The most complicated case I see does come from the "three instructions" thing: making
> > sure it's doing everything properly if they done individually. And "done individually"
> > happens for the "restart for exceptions or interrupts" case, even if the instructions
> > are right next to each other in the right order in the instruction stream.
> >
> > That's particularly true for the memcpy case, because it would be conceptually
> > sensible to always https://connect.linaro.org/resources/lvc21f/lvc21f-113/
> > do that first instruction (even for the "destination is already aligned" case) just to start the "have
> > previous source buffer ready for shifting with the next one" for the mutually unaligned case.
> >
> > Maybe when you take a trap on the middle instruction, the saved state will point to the first
> > instruction so that you always restart there (so that it looks like one atomic sequence)?
> >
> > IOW, the "three instruction" model really makes sense when you flow from one state to the
> > next, but it also adds its own excitement for the "(re)start in the middle of the sequence"
> > case. As per above, I think you can make that case an invalid situation, though.
>
> Presentation and video of the new instructions:
> https://connect.linaro.org/resources/lvc21f/lvc21f-113/
>
> The Exclamation means the register updates, and I think F means Forward.
> My guess is the middle instruction does vector aligned copies, but
> that can change per CPU design, so you always need all three?
>
> > And yes, I like "rep stos/movs" too, but I've also talked here about at least part of what makes
> > that a "good, but not perfect" interface. It has a lot of good things going for it (that whole
> > interruptibility is quite natural), but it does have some real complicating issues too from its
> > historical semantics (uncached and overlapping range semantics are the two big ones, I feel).
> >
> > So the intel implementation has to jump through some hoops due
> > to compatibility concerns, that the ARM model doesn't need to.
> >
> > But it's going to be some time before we see any implementation
> > of the ARM thing, so I guess we'll have to wait and see.
> >
> > I'm obviously happy to see this, and it looks sane to me. But you're right, implementations
> > aren't here yet, and maybe it won't look as rosy in a few years.
> >
> > Linus
Be nice if all the cache operations were like this too. It's bad making cache line lengths more visible than they have to be. But it definitely is going to make the hardware more complicated.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on September 17, 2021 8:59 am wrote:
> > Doug S (foo.delete@this.bar.bar) on September 16, 2021 10:57 pm wrote:
> > >
> > > It is easy to declare instructions for memcpy(). The devil is in the details of
> > > the implementation. How long did it take Intel to get it even halfway right?
> >
> > I think the ARM instructions look very reasonable, and are probably
> > not too bad to get right with just a few test cases.
> >
> > Famous last words.
> >
> > The memset case is most definitely the easier of the two,
> > since it doesn't have any issues with two different
> > pointer alignments. And splitting it up into three separate instructions ("initial align, loop over bulk,
> > final partial case") makes a lot of sense. It really should be fairly hard to screw up too badly.
> >
> > memcpy is more complicated, but it really shouldn't be horrible either.
> >
> > The most complicated case I see does come from the "three instructions" thing: making
> > sure it's doing everything properly if they done individually. And "done individually"
> > happens for the "restart for exceptions or interrupts" case, even if the instructions
> > are right next to each other in the right order in the instruction stream.
> >
> > That's particularly true for the memcpy case, because it would be conceptually
> > sensible to always https://connect.linaro.org/resources/lvc21f/lvc21f-113/
> > do that first instruction (even for the "destination is already aligned" case) just to start the "have
> > previous source buffer ready for shifting with the next one" for the mutually unaligned case.
> >
> > Maybe when you take a trap on the middle instruction, the saved state will point to the first
> > instruction so that you always restart there (so that it looks like one atomic sequence)?
> >
> > IOW, the "three instruction" model really makes sense when you flow from one state to the
> > next, but it also adds its own excitement for the "(re)start in the middle of the sequence"
> > case. As per above, I think you can make that case an invalid situation, though.
>
> Presentation and video of the new instructions:
> https://connect.linaro.org/resources/lvc21f/lvc21f-113/
>
> The Exclamation means the register updates, and I think F means Forward.
> My guess is the middle instruction does vector aligned copies, but
> that can change per CPU design, so you always need all three?
>
> > And yes, I like "rep stos/movs" too, but I've also talked here about at least part of what makes
> > that a "good, but not perfect" interface. It has a lot of good things going for it (that whole
> > interruptibility is quite natural), but it does have some real complicating issues too from its
> > historical semantics (uncached and overlapping range semantics are the two big ones, I feel).
> >
> > So the intel implementation has to jump through some hoops due
> > to compatibility concerns, that the ARM model doesn't need to.
> >
> > But it's going to be some time before we see any implementation
> > of the ARM thing, so I guess we'll have to wait and see.
> >
> > I'm obviously happy to see this, and it looks sane to me. But you're right, implementations
> > aren't here yet, and maybe it won't look as rosy in a few years.
> >
> > Linus
Be nice if all the cache operations were like this too. It's bad making cache line lengths more visible than they have to be. But it definitely is going to make the hardware more complicated.