By: Doug S (foo.delete@this.bar.bar), October 2, 2021 6:49 pm
Room: Moderated Discussions
rwessel (rwessel.delete@this.yahoo.com) on October 2, 2021 11:37 am wrote:
> Doug S (foo.delete@this.bar.bar) on October 2, 2021 9:47 am wrote:
> > Adrian (a.delete@this.acm.org) on October 2, 2021 3:45 am wrote:
> > > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > > >
> > > > Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> > > > to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
> > > >
> > > > AFAICS, it will simplify the CPU implementation a bit as well. So, why
> > >
> > >
> > > It would be better if the copy/set operations would be done not in the core, but by
> > > a special unit in the cache controller or in the memory controller, but the farther
> > > from the core it is, the more difficult the correct handling of page faults becomes.
> > >
> > > For better performance, the memcpy/memset should allow the execution of all other instructions
> > > to proceed without delays, but then there must be a way to check if the copy/set has
> > > finished, maybe by checking whether the count register has become null.
> > >
> > > That would also work with an asynchronous DMA unit, which would update the core registers only at
> > > the end of the operations, signalling the end, but there are various weird cases with the page faults,
> > > e.g. what happens if the page tables are modified while the copy/set is still in progress
> >
> >
> > ARM's mem* ISA may leave this possibility open to the implementation - maybe the guesses we've made
> > about how the three instructions work are wrong, and one instruction is a synchronous operation and
> > the others are a pair to start an asynchronous operation and check if it is complete? (Though the letters
> > in the opcodes don't really lend themselves to that, at least as far as I could come up with)
> >
> > I think the value of allowing a core to execute other instructions while a memory operation
> > is ongoing is reduced the more cores we have available to us. When we had only one or two cores,
> > this would have been a big deal. Now that we have a half dozen, dozen, or more, having one
> > core effectively unavailable while it is managing a large copy or zero operation is probably
> > not worth the complications that may arise from making the operation asynchronous.
> >
> > Even if the instructions work as we've surmised and are sort of a pre/main/post triplet there's nothing
> > stopping implementations from using clever tricks like DMA, cache magic, or whatever to make the
> > main part of the operation happen. If that core can't be used for other tasks until it is complete
> > (or an interrupt occurs) the actual work could in some implementations occur outside the core, in
> > a DMA unit, or in the cache; with the core more or less in a halt state during that time.
>
>
> But we really do want this to be useful for normal sized copies - on the order of a hundred
> bytes, or even less. And certainly is should be sane for rather smaller copies than that
> (if not necessarily the most efficient possible if the size is known in advance). Punting
> this to DMA units is likely to have fairly horrific startup overhead for that sized copy.
Just because it CAN use DMA doesn't mean it would be required to. The hardware in a particular implementation would be in the best decision to know the cost/benefit tradeoffs. It is more likely to make sense for operations sized in one or more pages than your run of the mill strcpy() or bzero() in application C code.
> Doug S (foo.delete@this.bar.bar) on October 2, 2021 9:47 am wrote:
> > Adrian (a.delete@this.acm.org) on October 2, 2021 3:45 am wrote:
> > > rpg (a.delete@this.b.com) on October 2, 2021 2:51 am wrote:
> > > > Why handle memcpy with microcoded instructions/cracked uOPs?
> > > >
> > > > Wouldn't a simple DMA unit be able to handle this? IE, if you can setup DMA from various IO controllers
> > > > to RAM, then maybe this is all you need. (Atleast the non-overlapping case should be fine).
> > > >
> > > > AFAICS, it will simplify the CPU implementation a bit as well. So, why
> > >
> > >
> > > It would be better if the copy/set operations would be done not in the core, but by
> > > a special unit in the cache controller or in the memory controller, but the farther
> > > from the core it is, the more difficult the correct handling of page faults becomes.
> > >
> > > For better performance, the memcpy/memset should allow the execution of all other instructions
> > > to proceed without delays, but then there must be a way to check if the copy/set has
> > > finished, maybe by checking whether the count register has become null.
> > >
> > > That would also work with an asynchronous DMA unit, which would update the core registers only at
> > > the end of the operations, signalling the end, but there are various weird cases with the page faults,
> > > e.g. what happens if the page tables are modified while the copy/set is still in progress
> >
> >
> > ARM's mem* ISA may leave this possibility open to the implementation - maybe the guesses we've made
> > about how the three instructions work are wrong, and one instruction is a synchronous operation and
> > the others are a pair to start an asynchronous operation and check if it is complete? (Though the letters
> > in the opcodes don't really lend themselves to that, at least as far as I could come up with)
> >
> > I think the value of allowing a core to execute other instructions while a memory operation
> > is ongoing is reduced the more cores we have available to us. When we had only one or two cores,
> > this would have been a big deal. Now that we have a half dozen, dozen, or more, having one
> > core effectively unavailable while it is managing a large copy or zero operation is probably
> > not worth the complications that may arise from making the operation asynchronous.
> >
> > Even if the instructions work as we've surmised and are sort of a pre/main/post triplet there's nothing
> > stopping implementations from using clever tricks like DMA, cache magic, or whatever to make the
> > main part of the operation happen. If that core can't be used for other tasks until it is complete
> > (or an interrupt occurs) the actual work could in some implementations occur outside the core, in
> > a DMA unit, or in the cache; with the core more or less in a halt state during that time.
>
>
> But we really do want this to be useful for normal sized copies - on the order of a hundred
> bytes, or even less. And certainly is should be sane for rather smaller copies than that
> (if not necessarily the most efficient possible if the size is known in advance). Punting
> this to DMA units is likely to have fairly horrific startup overhead for that sized copy.
Just because it CAN use DMA doesn't mean it would be required to. The hardware in a particular implementation would be in the best decision to know the cost/benefit tradeoffs. It is more likely to make sense for operations sized in one or more pages than your run of the mill strcpy() or bzero() in application C code.