By: Doug S (foo.delete@this.bar.bar), October 2, 2021 9:47 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on October 2, 2021 3:45 am wrote:
> rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > Why handle memcpy with microcoded instructions/cracked uOPs?
> >
> > Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> > to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
> >
> > AFAICS, it will simplify the CPU implementation a bit as well. So, why
>
>
> It would be better if the copy/set operations would be done not in the core, but by
> a special unit in the cache controller or in the memory controller, but the farther
> from the core it is, the more difficult the correct handling of page faults becomes.
>
> For better performance, the memcpy/memset should allow the execution of all other instructions
> to proceed without delays, but then there must be a way to check if the copy/set has
> finished, maybe by checking whether the count register has become null.
>
> That would also work with an asynchronous DMA unit, which would update the core registers only at
> the end of the operations, signalling the end, but there are various weird cases with the page faults,
> e.g. what happens if the page tables are modified while the copy/set is still in progress
ARM's mem* ISA may leave this possibility open to the implementation - maybe the guesses we've made about how the three instructions work are wrong, and one instruction is a synchronous operation and the others are a pair to start an asynchronous operation and check if it is complete? (Though the letters in the opcodes don't really lend themselves to that, at least as far as I could come up with)
I think the value of allowing a core to execute other instructions while a memory operation is ongoing is reduced the more cores we have available to us. When we had only one or two cores, this would have been a big deal. Now that we have a half dozen, dozen, or more, having one core effectively unavailable while it is managing a large copy or zero operation is probably not worth the complications that may arise from making the operation asynchronous.
Even if the instructions work as we've surmised and are sort of a pre/main/post triplet there's nothing stopping implementations from using clever tricks like DMA, cache magic, or whatever to make the main part of the operation happen. If that core can't be used for other tasks until it is complete (or an interrupt occurs) the actual work could in some implementations occur outside the core, in a DMA unit, or in the cache; with the core more or less in a halt state during that time.
> rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > Why handle memcpy with microcoded instructions/cracked uOPs?
> >
> > Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> > to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
> >
> > AFAICS, it will simplify the CPU implementation a bit as well. So, why
>
>
> It would be better if the copy/set operations would be done not in the core, but by
> a special unit in the cache controller or in the memory controller, but the farther
> from the core it is, the more difficult the correct handling of page faults becomes.
>
> For better performance, the memcpy/memset should allow the execution of all other instructions
> to proceed without delays, but then there must be a way to check if the copy/set has
> finished, maybe by checking whether the count register has become null.
>
> That would also work with an asynchronous DMA unit, which would update the core registers only at
> the end of the operations, signalling the end, but there are various weird cases with the page faults,
> e.g. what happens if the page tables are modified while the copy/set is still in progress
ARM's mem* ISA may leave this possibility open to the implementation - maybe the guesses we've made about how the three instructions work are wrong, and one instruction is a synchronous operation and the others are a pair to start an asynchronous operation and check if it is complete? (Though the letters in the opcodes don't really lend themselves to that, at least as far as I could come up with)
I think the value of allowing a core to execute other instructions while a memory operation is ongoing is reduced the more cores we have available to us. When we had only one or two cores, this would have been a big deal. Now that we have a half dozen, dozen, or more, having one core effectively unavailable while it is managing a large copy or zero operation is probably not worth the complications that may arise from making the operation asynchronous.
Even if the instructions work as we've surmised and are sort of a pre/main/post triplet there's nothing stopping implementations from using clever tricks like DMA, cache magic, or whatever to make the main part of the operation happen. If that core can't be used for other tasks until it is complete (or an interrupt occurs) the actual work could in some implementations occur outside the core, in a DMA unit, or in the cache; with the core more or less in a halt state during that time.