By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), July 16, 2015 6:16 pm
Room: Moderated Discussions
rwessel (robertwessel.delete@this.yahoo.com) on July 16, 2015 1:00 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 9:33 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 8:29 am wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > > > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
> > [snip]
> > > > > As for string instructions (rep stos/rep movs) I've relied on the
> > > > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > > > so I'm confident this works (why would the CPU push out string writes in random order).
> > > >
> > > > I doubt a CPU would want to push it out in random order, but
> > > > consider the problem in general with weak consistency.
> > > > If there are two memory controllers and the string occupies
> > > > three lines then the second memory controller might
> > > > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> > > >
> > >
> > > As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> > > not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> > > case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> > > instructions are interruptible/restartable and the only saved states are direction, two pointers and
> > > counter. With such minimal state saved only continuous operation can be correctly restarted.
> >
> > Does the state need to be architectural? For example, is there a guarantee that the count on interrupt
> > indicates the stopping point? If microarchitectural storage could be used to track what has been
> > done (and perhaps even offloading copying to a DMA engine on interrupt/thread switch), perhaps using
> > -1 as a magic value to check for a partial operation in this microarchitectural storage (the overhead
> > of such a check would be tiny for an actual maximum-sized operation and might be acceptable to allow
> > faster progress in the operation in the common case of no interruption).
>
>
> As defined, the state of the registers after an interrupted x86 string instructions
> do contain what's needed to restart the instruction, and are the values you'd expect
> (updated addresses and length based on the number of items already moved).
While the way the definition is phrased seems to imply that the instruction is not merely restartable and that restarting must be possible for a straightforward implementation ("The source and destination registers point to the next string elements to be operated on, the EIP
register points to the string instruction, and the ECX register has the value it held
following the last successful iteration of the instruction."), a mostly compatible version could be imagined where out-of-order copying could be performed. It might even be possible support suspend to disk with wake-up compatibility with processors not providing such microarchitectural state.
The hard part seems to be in determining when an out-of-order copy has been restarted and which copy is being restarted. Using the maximum value (which is apparently zero not -1, duh) to hint that work has been done would violate the current definition but probably not in a way that would be problematic. Determining that such a hint is true and which of the "cached" partially completed copies is being restarted seems more difficult. I don't think one can use the address of the context save area to uniquely tag the copy. Even the addresses, count, and page table base address might not be sufficient to definitively tag a copy. (The OS could change the page table base address or worse reassign it to a different address space. Hardware could monitor the page table base page and ignore the out-of-order work if modified, which would seem to solve the reassignment problem. However, I don't think that is enough to avoid problems under bizarre conditions which are technically legal.)
If the copy can be restarted from the farthest contiguous value, it would be possible to be mostly compatible (allowing memory which would eventually be overwritten to be overwritten early) and restartable in a compatible manner.
While it would be theoretically possible to mark the thread state in an entirely microarchitectural manner such that the out-of-order work could usually be retained (and at worse redone), I think more than just the ability to overwrite early would have to be added to the definition of x86. (For restarting after a simple interrupt where the page table base is not changed (and the page tables are not altered), it might be practical to avoid redoing out-of-order work even for an x86 extended merely to allow early overwriting.)
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 9:33 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 8:29 am wrote:
> > > dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > > > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
> > [snip]
> > > > > As for string instructions (rep stos/rep movs) I've relied on the
> > > > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > > > so I'm confident this works (why would the CPU push out string writes in random order).
> > > >
> > > > I doubt a CPU would want to push it out in random order, but
> > > > consider the problem in general with weak consistency.
> > > > If there are two memory controllers and the string occupies
> > > > three lines then the second memory controller might
> > > > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> > > >
> > >
> > > As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> > > not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> > > case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> > > instructions are interruptible/restartable and the only saved states are direction, two pointers and
> > > counter. With such minimal state saved only continuous operation can be correctly restarted.
> >
> > Does the state need to be architectural? For example, is there a guarantee that the count on interrupt
> > indicates the stopping point? If microarchitectural storage could be used to track what has been
> > done (and perhaps even offloading copying to a DMA engine on interrupt/thread switch), perhaps using
> > -1 as a magic value to check for a partial operation in this microarchitectural storage (the overhead
> > of such a check would be tiny for an actual maximum-sized operation and might be acceptable to allow
> > faster progress in the operation in the common case of no interruption).
>
>
> As defined, the state of the registers after an interrupted x86 string instructions
> do contain what's needed to restart the instruction, and are the values you'd expect
> (updated addresses and length based on the number of items already moved).
While the way the definition is phrased seems to imply that the instruction is not merely restartable and that restarting must be possible for a straightforward implementation ("The source and destination registers point to the next string elements to be operated on, the EIP
register points to the string instruction, and the ECX register has the value it held
following the last successful iteration of the instruction."), a mostly compatible version could be imagined where out-of-order copying could be performed. It might even be possible support suspend to disk with wake-up compatibility with processors not providing such microarchitectural state.
The hard part seems to be in determining when an out-of-order copy has been restarted and which copy is being restarted. Using the maximum value (which is apparently zero not -1, duh) to hint that work has been done would violate the current definition but probably not in a way that would be problematic. Determining that such a hint is true and which of the "cached" partially completed copies is being restarted seems more difficult. I don't think one can use the address of the context save area to uniquely tag the copy. Even the addresses, count, and page table base address might not be sufficient to definitively tag a copy. (The OS could change the page table base address or worse reassign it to a different address space. Hardware could monitor the page table base page and ignore the out-of-order work if modified, which would seem to solve the reassignment problem. However, I don't think that is enough to avoid problems under bizarre conditions which are technically legal.)
If the copy can be restarted from the farthest contiguous value, it would be possible to be mostly compatible (allowing memory which would eventually be overwritten to be overwritten early) and restartable in a compatible manner.
While it would be theoretically possible to mark the thread state in an entirely microarchitectural manner such that the out-of-order work could usually be retained (and at worse redone), I think more than just the ability to overwrite early would have to be added to the definition of x86. (For restarting after a simple interrupt where the page table base is not changed (and the page tables are not altered), it might be practical to avoid redoing out-of-order work even for an x86 extended merely to allow early overwriting.)