By: rwessel (rwessel.delete@this.yahoo.com), October 2, 2021 4:57 am
Room: Moderated Discussions
Doug S (foo.delete@this.bar.bar) on October 1, 2021 11:17 am wrote:
> rwessel (rwessel.delete@this.yahoo.com) on October 1, 2021 8:14 am wrote:
> > Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on October 1, 2021 7:55 am wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on October 1, 2021 5:10 am wrote:
> > > > > My latest theory is that all three instructions form a unit and if there is an interrupt the restart
> > > > > is at the first instruction.
> > >
> > > Alternatively they are using a real DMA which goes through the memory caches.
> > > The first instruction setup the DMA.
> > > The second instruction wait for the DMA to finish, and if interrupted pause that DMA so that
> > > the memory bus is not completely saturated while executing the interrupt treatment.
> > > The third instruction check if the DMA was finished, if not the second instruction
> > > was interrupted and needs to be re-executed (so jump backward one instruction).
> > > If the DMA is finished, then free/power down the DMA machinery.
> > >
> > > That has the advantage of not using registers, not copying in smaller unit than the cache line, so
> > > not needing to read the *destination* memory into the cache line - just in case the copy is interrupted.
> > > To repeat, if processor is interrupted, the end of the cache line has to be coherent, so has to be
> > > read from memory (processor doesn't know yet if it will be interrupted) before being written.
> > >
> > > The problem of interrupting a memcopy and then try to re-execute another memcopy from inside the interrupt
> > > treatment has to be solved by saving the exact state of the DMA in available registers.
> > >
> > > It is out of the question to re-read part of the source, or write twice
> > > the destination (after interrupt) in case of I/O or uncached memcopy.
> >
> >
> > You still either need to architect the (saved) state of the DMA engine, or you have
> > trouble moving the running code to different cores after an interrupt (big/little, intra-cluster,
> > VM migrations). And you now need the OS to save that additional state.
>
>
> The '!' notation on ARM means the registers for these new opcodes are modified, right? There
> are more bits in the registers than are needed to represent an address, at least until we
> get hardware with 64 bit VA though tags and PAC may conflict. The 'length' argument does
> not have this problem. It is probably bounded in some way - if 2^32-1 is the largest size
> it supports that leaves 32 unused bits, plenty of room for other state to be saved.
Those sorts of things do tend to come back to bite you later, though. But so long as it's far enough off that they can do a "fixed" implementation before it becomes an issue.
One immediate downside of limiting copy length is that it makes it harder for compilers to inline memcpys. They need to prove that the length being copied is within the instruction's limit, or generate the required loop. The loop is no problem for actual long copies, but becomes painful if the actual copies are short (but you can't prove it).
S/370 MVCL had just that problem, specifying 24-bit lengths* (the upper 8 bits being used to specify a pad character). They fixed that with the newer Move Long Extended.
*Covering the entire address space at the time.
> rwessel (rwessel.delete@this.yahoo.com) on October 1, 2021 8:14 am wrote:
> > Etienne Lorrain (etienne_lorrain.delete@this.yahoo.fr) on October 1, 2021 7:55 am wrote:
> > > rwessel (rwessel.delete@this.yahoo.com) on October 1, 2021 5:10 am wrote:
> > > > > My latest theory is that all three instructions form a unit and if there is an interrupt the restart
> > > > > is at the first instruction.
> > >
> > > Alternatively they are using a real DMA which goes through the memory caches.
> > > The first instruction setup the DMA.
> > > The second instruction wait for the DMA to finish, and if interrupted pause that DMA so that
> > > the memory bus is not completely saturated while executing the interrupt treatment.
> > > The third instruction check if the DMA was finished, if not the second instruction
> > > was interrupted and needs to be re-executed (so jump backward one instruction).
> > > If the DMA is finished, then free/power down the DMA machinery.
> > >
> > > That has the advantage of not using registers, not copying in smaller unit than the cache line, so
> > > not needing to read the *destination* memory into the cache line - just in case the copy is interrupted.
> > > To repeat, if processor is interrupted, the end of the cache line has to be coherent, so has to be
> > > read from memory (processor doesn't know yet if it will be interrupted) before being written.
> > >
> > > The problem of interrupting a memcopy and then try to re-execute another memcopy from inside the interrupt
> > > treatment has to be solved by saving the exact state of the DMA in available registers.
> > >
> > > It is out of the question to re-read part of the source, or write twice
> > > the destination (after interrupt) in case of I/O or uncached memcopy.
> >
> >
> > You still either need to architect the (saved) state of the DMA engine, or you have
> > trouble moving the running code to different cores after an interrupt (big/little, intra-cluster,
> > VM migrations). And you now need the OS to save that additional state.
>
>
> The '!' notation on ARM means the registers for these new opcodes are modified, right? There
> are more bits in the registers than are needed to represent an address, at least until we
> get hardware with 64 bit VA though tags and PAC may conflict. The 'length' argument does
> not have this problem. It is probably bounded in some way - if 2^32-1 is the largest size
> it supports that leaves 32 unused bits, plenty of room for other state to be saved.
Those sorts of things do tend to come back to bite you later, though. But so long as it's far enough off that they can do a "fixed" implementation before it becomes an issue.
One immediate downside of limiting copy length is that it makes it harder for compilers to inline memcpys. They need to prove that the length being copied is within the instruction's limit, or generate the required loop. The loop is no problem for actual long copies, but becomes painful if the actual copies are short (but you can't prove it).
S/370 MVCL had just that problem, specifying 24-bit lengths* (the upper 8 bits being used to specify a pad character). They fixed that with the newer Move Long Extended.
*Covering the entire address space at the time.