By: Adrian (a.delete@this.acm.org), October 2, 2021 10:15 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on October 2, 2021 9:47 am wrote:
> Adrian (a.delete@this.acm.org) on October 2, 2021 3:45 am wrote:
> > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > >
> > > Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> > > to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
> > >
> > > AFAICS, it will simplify the CPU implementation a bit as well. So, why
> >
> >
> > It would be better if the copy/set operations would be done not in the core, but by
> > a special unit in the cache controller or in the memory controller, but the farther
> > from the core it is, the more difficult the correct handling of page faults becomes.
> >
> > For better performance, the memcpy/memset should allow the execution of all other instructions
> > to proceed without delays, but then there must be a way to check if the copy/set has
> > finished, maybe by checking whether the count register has become null.
> >
> > That would also work with an asynchronous DMA unit, which would update the core registers only at
> > the end of the operations, signalling the end, but there are various weird cases with the page faults,
> > e.g. what happens if the page tables are modified while the copy/set is still in progress
>
>
> ARM's mem* ISA may leave this possibility open to the implementation - maybe the guesses we've made
> about how the three instructions work are wrong, and one instruction is a synchronous operation and
> the others are a pair to start an asynchronous operation and check if it is complete? (Though the letters
> in the opcodes don't really lend themselves to that, at least as far as I could come up with)
>
> I think the value of allowing a core to execute other instructions while a memory operation
> is ongoing is reduced the more cores we have available to us. When we had only one or two cores,
> this would have been a big deal. Now that we have a half dozen, dozen, or more, having one
> core effectively unavailable while it is managing a large copy or zero operation is probably
> not worth the complications that may arise from making the operation asynchronous.
>
> Even if the instructions work as we've surmised and are sort of a pre/main/post triplet there's nothing
> stopping implementations from using clever tricks like DMA, cache magic, or whatever to make the
> main part of the operation happen. If that core can't be used for other tasks until it is complete
> (or an interrupt occurs) the actual work could in some implementations occur outside the core, in
> a DMA unit, or in the cache; with the core more or less in a halt state during that time.
I agree.
In such an implementation, the advantage of doing the copy/set operation in a dedicated unit outside the core, with the core waiting for the finish, would be not improving the performance, but reducing the power consumption.
> Adrian (a.delete@this.acm.org) on October 2, 2021 3:45 am wrote:
> > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > >
> > > Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> > > to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
> > >
> > > AFAICS, it will simplify the CPU implementation a bit as well. So, why
> >
> >
> > It would be better if the copy/set operations would be done not in the core, but by
> > a special unit in the cache controller or in the memory controller, but the farther
> > from the core it is, the more difficult the correct handling of page faults becomes.
> >
> > For better performance, the memcpy/memset should allow the execution of all other instructions
> > to proceed without delays, but then there must be a way to check if the copy/set has
> > finished, maybe by checking whether the count register has become null.
> >
> > That would also work with an asynchronous DMA unit, which would update the core registers only at
> > the end of the operations, signalling the end, but there are various weird cases with the page faults,
> > e.g. what happens if the page tables are modified while the copy/set is still in progress
>
>
> ARM's mem* ISA may leave this possibility open to the implementation - maybe the guesses we've made
> about how the three instructions work are wrong, and one instruction is a synchronous operation and
> the others are a pair to start an asynchronous operation and check if it is complete? (Though the letters
> in the opcodes don't really lend themselves to that, at least as far as I could come up with)
>
> I think the value of allowing a core to execute other instructions while a memory operation
> is ongoing is reduced the more cores we have available to us. When we had only one or two cores,
> this would have been a big deal. Now that we have a half dozen, dozen, or more, having one
> core effectively unavailable while it is managing a large copy or zero operation is probably
> not worth the complications that may arise from making the operation asynchronous.
>
> Even if the instructions work as we've surmised and are sort of a pre/main/post triplet there's nothing
> stopping implementations from using clever tricks like DMA, cache magic, or whatever to make the
> main part of the operation happen. If that core can't be used for other tasks until it is complete
> (or an interrupt occurs) the actual work could in some implementations occur outside the core, in
> a DMA unit, or in the cache; with the core more or less in a halt state during that time.
I agree.
In such an implementation, the advantage of doing the copy/set operation in a dedicated unit outside the core, with the core waiting for the finish, would be not improving the performance, but reducing the power consumption.