By: Adrian (a.delete@this.acm.org), October 2, 2021 2:45 am
Room: Moderated Discussions
rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 1, 2021 11:01 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on October 1, 2021 5:04 am wrote:
> > >
> > > Like, first instruction brings destination to [coarse] aligned boundary etc...
> >
> > It could be even simpler.
> >
> > The first instruction might not do anything about the actual copy at all.
> >
> > It might just do pure bookkeeping functionality, like "check overlapping ranges" or "check
> > if it's large enough and mutually aligned so that you can do cacheline level optimizations".
> > Things like setting flags to say how to copy (kind of like how x86 uses the DF flag).
> >
> > That would make the first instruction fairly uninteresting, and the second instruction
> > would be the one that does all the repeating work (with the third instruction doing
> > what? Maybe the final tail, maybe just some internal state cleanup?)
> >
> > But if the restart happens on the second instructions, I
> > don't know where the first instruction would squirrel
> > away any state information it has determined, though. It
> > would have to be in some architected register state,
> > so that nested memory copies work (ie taking a page fault, doing another memory copy in the kernel or VMM).
> >
> > So I personally think it would be best to always cause restarts to restart at the first instruction,
> > exactly so that you could have magic micro-architectural hidden state. If you always restart
> > at the first instruction, you could literally have hidden "previous read" buffers for the
> > mutually unaligned case, hidden "do it with cache transfers" flags, or direction flags etc,
> > and never expose your random microarchitectural choices anywhere else.
> >
> > And so it would allow you to migrate cleanly between different microarchitectures
> > (either BIG.little or just VM migration) without any odd special cases.
> >
> > VM migration is an interesting case, and having it happen in the middle of a big memory copy is not
> > at all some kind of exceptionally unusual situation. So any model that does something special in
> > the first instruction - and then exposes restarts on the second one - sounds a bit iffy to me.
> >
> > IOW, restart at the first instruction really seems like the technically correct solution.
> >
> > This is something the x86 "rep movs" got right. No odd partial instruction restart cases.
> >
> > Of course, "rep movs" has other problems, so..
> >
> > Linus
>
> Why handle memcpy with microcoded instructions/cracked uOPs?
>
> Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
>
> AFAICS, it will simplify the CPU implementation a bit as well. So, why
It would be better if the copy/set operations would be done not in the core, but by a special unit in the cache controller or in the memory controller, but the farther from the core it is, the more difficult the correct handling of page faults becomes.
For better performance, the memcpy/memset should allow the execution of all other instructions to proceed without delays, but then there must be a way to check if the copy/set has finished, maybe by checking whether the count register has become null.
That would also work with an asynchronous DMA unit, which would update the core registers only at the end of the operations, signalling the end, but there are various weird cases with the page faults, e.g. what happens if the page tables are modified while the copy/set is still in progress.
> Linus Torvalds (torvalds.delete@this.linux-foundation.org) on October 1, 2021 11:01 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on October 1, 2021 5:04 am wrote:
> > >
> > > Like, first instruction brings destination to [coarse] aligned boundary etc...
> >
> > It could be even simpler.
> >
> > The first instruction might not do anything about the actual copy at all.
> >
> > It might just do pure bookkeeping functionality, like "check overlapping ranges" or "check
> > if it's large enough and mutually aligned so that you can do cacheline level optimizations".
> > Things like setting flags to say how to copy (kind of like how x86 uses the DF flag).
> >
> > That would make the first instruction fairly uninteresting, and the second instruction
> > would be the one that does all the repeating work (with the third instruction doing
> > what? Maybe the final tail, maybe just some internal state cleanup?)
> >
> > But if the restart happens on the second instructions, I
> > don't know where the first instruction would squirrel
> > away any state information it has determined, though. It
> > would have to be in some architected register state,
> > so that nested memory copies work (ie taking a page fault, doing another memory copy in the kernel or VMM).
> >
> > So I personally think it would be best to always cause restarts to restart at the first instruction,
> > exactly so that you could have magic micro-architectural hidden state. If you always restart
> > at the first instruction, you could literally have hidden "previous read" buffers for the
> > mutually unaligned case, hidden "do it with cache transfers" flags, or direction flags etc,
> > and never expose your random microarchitectural choices anywhere else.
> >
> > And so it would allow you to migrate cleanly between different microarchitectures
> > (either BIG.little or just VM migration) without any odd special cases.
> >
> > VM migration is an interesting case, and having it happen in the middle of a big memory copy is not
> > at all some kind of exceptionally unusual situation. So any model that does something special in
> > the first instruction - and then exposes restarts on the second one - sounds a bit iffy to me.
> >
> > IOW, restart at the first instruction really seems like the technically correct solution.
> >
> > This is something the x86 "rep movs" got right. No odd partial instruction restart cases.
> >
> > Of course, "rep movs" has other problems, so..
> >
> > Linus
>
> Why handle memcpy with microcoded instructions/cracked uOPs?
>
> Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
>
> AFAICS, it will simplify the CPU implementation a bit as well. So, why
It would be better if the copy/set operations would be done not in the core, but by a special unit in the cache controller or in the memory controller, but the farther from the core it is, the more difficult the correct handling of page faults becomes.
For better performance, the memcpy/memset should allow the execution of all other instructions to proceed without delays, but then there must be a way to check if the copy/set has finished, maybe by checking whether the count register has become null.
That would also work with an asynchronous DMA unit, which would update the core registers only at the end of the operations, signalling the end, but there are various weird cases with the page faults, e.g. what happens if the page tables are modified while the copy/set is still in progress.