By: Michael S (already5chosen.delete@this.yahoo.com), July 17, 2015 4:39 am
Room: Moderated Discussions
rwessel (robertwessel.delete@this.yahoo.com) on July 16, 2015 9:34 pm wrote:
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 6:16 pm wrote:
> > rwessel (robertwessel.delete@this.yahoo.com) on July 16, 2015 1:00 pm wrote:
> > > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 9:33 am wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 8:29 am wrote:
> > > > > dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > > > > > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
> > > > [snip]
> > > > > > > As for string instructions (rep stos/rep movs) I've relied on the
> > > > > > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > > > > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > > > > > so I'm confident this works (why would the CPU push out string writes in random order).
> > > > > >
> > > > > > I doubt a CPU would want to push it out in random order, but
> > > > > > consider the problem in general with weak consistency.
> > > > > > If there are two memory controllers and the string occupies
> > > > > > three lines then the second memory controller might
> > > > > > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> > > > > >
> > > > >
> > > > > As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> > > > > not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> > > > > case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> > > > > instructions are interruptible/restartable and the only saved states are direction, two pointers and
> > > > > counter. With such minimal state saved only continuous operation can be correctly restarted.
> > > >
> > > > Does the state need to be architectural? For example, is there a guarantee that the count on interrupt
> > > > indicates the stopping point? If microarchitectural storage could be used to track what has been
> > > > done (and perhaps even offloading copying to a DMA engine on interrupt/thread switch), perhaps using
> > > > -1 as a magic value to check for a partial operation in this microarchitectural storage (the overhead
> > > > of such a check would be tiny for an actual maximum-sized operation and might be acceptable to allow
> > > > faster progress in the operation in the common case of no interruption).
> > >
> > >
> > > As defined, the state of the registers after an interrupted x86 string instructions
> > > do contain what's needed to restart the instruction, and are the values you'd expect
> > > (updated addresses and length based on the number of items already moved).
> >
> > While the way the definition is phrased seems to imply that the instruction is not merely restartable
> > and that restarting must be possible for a straightforward implementation ("The source and destination
> > registers point to the next string elements to be operated on, the EIP
> > register points to the string instruction, and the ECX register has the value it held
> > following the last successful iteration of the instruction."), a mostly compatible version could
> > be imagined where out-of-order copying could be performed. It might even be possible support suspend
> > to disk with wake-up compatibility with processors not providing such microarchitectural state.
> >
> > The hard part seems to be in determining when an out-of-order copy has been restarted and which copy
> > is being restarted. Using the maximum value (which is apparently zero not -1, duh) to hint that work
> > has been done would violate the current definition but probably not in a way that would be problematic.
> > Determining that such a hint is true and which of the "cached" partially completed copies is being
> > restarted seems more difficult. I don't think one can use the address of the context save area to uniquely
> > tag the copy. Even the addresses, count, and page table
> > base address might not be sufficient to definitively
> > tag a copy. (The OS could change the page table base address or worse reassign it to a different address
> > space. Hardware could monitor the page table base page and ignore the out-of-order work if modified,
> > which would seem to solve the reassignment problem. However, I don't think that is enough to avoid
> > problems under bizarre conditions which are technically legal.)
> >
> > If the copy can be restarted from the farthest contiguous value, it would be
> > possible to be mostly compatible (allowing memory which would eventually be
> > overwritten to be overwritten early) and restartable in a compatible manner.
> >
> > While it would be theoretically possible to mark the thread state in an entirely microarchitectural manner
> > such that the out-of-order work could usually be retained (and at worse redone), I think more than just the
> > ability to overwrite early would have to be added to the definition of x86. (For restarting after a simple
> > interrupt where the page table base is not changed (and the
> > page tables are not altered), it might be practical
> > to avoid redoing out-of-order work even for an x86 extended merely to allow early overwriting.)
>
>
> Certainly processors where internal state has been saved for interrupts, or even interruptable instructions,
> exist. A problem with that is that restoring internal state, especially after an exception, can
> be a real PITA. The early 68Ks are a good example - they'd dump a bunch of internal state onto
> the stack on an exception. Attempting a restart (say after a paging event), could be very interesting.
> S/370 has a number of instructions that do things like that, for example Test Page, although it's
> defined so that the entire (non-architected) state is stored in a register (and you actually need
> to zero the register before executing the instruction the first time).
>
> The latter approach would not be hard to adapt to x86, you just need the string
> instructions to save their state in a couple of additional registers.
>
> The question is whether or not this is actually worth it. You can clearly do a fair chunk of work between
> *allowing* interrupts. Assuming both pages (for the source and destination areas) are accessible (and
> they'd have to be), you can simply refuse to take an interrupt until the move progresses to the end of
> one of those pages. That would be an average of 2KB of work you can do in unusual orders between possible
> interrupts. It seems likely that would get you almost all of the benefit of out-of-order moving.
Not when a single threads has a memory subsystem of Xeon E5-16xx v3 all for itself. Bandwidth-delay product of this thing is easily in 4-5KB range. And that's even before we took on multi-socket and high-delay fully-buffered memory links.
> And
> you're not even limited to that, certainly the CPU *could* check if several consecutive pages were available
> for both the source and destination, and just treat that as the unit of work.
>
> A problem is that you want to limit the amount of work you do or you start to impact
> interrupt response time. Some of the zArch string instructions are defined to do a
> "CPU determined" amount of work (although most of those are capped at a page).
>
> You also don't need to synchronize the units of work and the interrupt testing too much.
> For example, you can check for the next unit of work while doing the prior unit of work, and
> get started on that if no interrupt is pending, even before finishing the current unit of
> work (you just can't take an interrupt until all start started units of work are done).
>
That does not sound like a problem. For better or worse, x86 users accustomed to high interrupt latencies from the very beginning. All PC hardware/drivers infrastructure is already build on assumption that interrupts are slow.
> So quite large amounts of work are possible without having
> to resort to storing internal state at an interrupt.
> Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 6:16 pm wrote:
> > rwessel (robertwessel.delete@this.yahoo.com) on July 16, 2015 1:00 pm wrote:
> > > Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 9:33 am wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 8:29 am wrote:
> > > > > dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > > > > > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
> > > > [snip]
> > > > > > > As for string instructions (rep stos/rep movs) I've relied on the
> > > > > > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > > > > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > > > > > so I'm confident this works (why would the CPU push out string writes in random order).
> > > > > >
> > > > > > I doubt a CPU would want to push it out in random order, but
> > > > > > consider the problem in general with weak consistency.
> > > > > > If there are two memory controllers and the string occupies
> > > > > > three lines then the second memory controller might
> > > > > > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> > > > > >
> > > > >
> > > > > As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> > > > > not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> > > > > case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> > > > > instructions are interruptible/restartable and the only saved states are direction, two pointers and
> > > > > counter. With such minimal state saved only continuous operation can be correctly restarted.
> > > >
> > > > Does the state need to be architectural? For example, is there a guarantee that the count on interrupt
> > > > indicates the stopping point? If microarchitectural storage could be used to track what has been
> > > > done (and perhaps even offloading copying to a DMA engine on interrupt/thread switch), perhaps using
> > > > -1 as a magic value to check for a partial operation in this microarchitectural storage (the overhead
> > > > of such a check would be tiny for an actual maximum-sized operation and might be acceptable to allow
> > > > faster progress in the operation in the common case of no interruption).
> > >
> > >
> > > As defined, the state of the registers after an interrupted x86 string instructions
> > > do contain what's needed to restart the instruction, and are the values you'd expect
> > > (updated addresses and length based on the number of items already moved).
> >
> > While the way the definition is phrased seems to imply that the instruction is not merely restartable
> > and that restarting must be possible for a straightforward implementation ("The source and destination
> > registers point to the next string elements to be operated on, the EIP
> > register points to the string instruction, and the ECX register has the value it held
> > following the last successful iteration of the instruction."), a mostly compatible version could
> > be imagined where out-of-order copying could be performed. It might even be possible support suspend
> > to disk with wake-up compatibility with processors not providing such microarchitectural state.
> >
> > The hard part seems to be in determining when an out-of-order copy has been restarted and which copy
> > is being restarted. Using the maximum value (which is apparently zero not -1, duh) to hint that work
> > has been done would violate the current definition but probably not in a way that would be problematic.
> > Determining that such a hint is true and which of the "cached" partially completed copies is being
> > restarted seems more difficult. I don't think one can use the address of the context save area to uniquely
> > tag the copy. Even the addresses, count, and page table
> > base address might not be sufficient to definitively
> > tag a copy. (The OS could change the page table base address or worse reassign it to a different address
> > space. Hardware could monitor the page table base page and ignore the out-of-order work if modified,
> > which would seem to solve the reassignment problem. However, I don't think that is enough to avoid
> > problems under bizarre conditions which are technically legal.)
> >
> > If the copy can be restarted from the farthest contiguous value, it would be
> > possible to be mostly compatible (allowing memory which would eventually be
> > overwritten to be overwritten early) and restartable in a compatible manner.
> >
> > While it would be theoretically possible to mark the thread state in an entirely microarchitectural manner
> > such that the out-of-order work could usually be retained (and at worse redone), I think more than just the
> > ability to overwrite early would have to be added to the definition of x86. (For restarting after a simple
> > interrupt where the page table base is not changed (and the
> > page tables are not altered), it might be practical
> > to avoid redoing out-of-order work even for an x86 extended merely to allow early overwriting.)
>
>
> Certainly processors where internal state has been saved for interrupts, or even interruptable instructions,
> exist. A problem with that is that restoring internal state, especially after an exception, can
> be a real PITA. The early 68Ks are a good example - they'd dump a bunch of internal state onto
> the stack on an exception. Attempting a restart (say after a paging event), could be very interesting.
> S/370 has a number of instructions that do things like that, for example Test Page, although it's
> defined so that the entire (non-architected) state is stored in a register (and you actually need
> to zero the register before executing the instruction the first time).
>
> The latter approach would not be hard to adapt to x86, you just need the string
> instructions to save their state in a couple of additional registers.
>
> The question is whether or not this is actually worth it. You can clearly do a fair chunk of work between
> *allowing* interrupts. Assuming both pages (for the source and destination areas) are accessible (and
> they'd have to be), you can simply refuse to take an interrupt until the move progresses to the end of
> one of those pages. That would be an average of 2KB of work you can do in unusual orders between possible
> interrupts. It seems likely that would get you almost all of the benefit of out-of-order moving.
Not when a single threads has a memory subsystem of Xeon E5-16xx v3 all for itself. Bandwidth-delay product of this thing is easily in 4-5KB range. And that's even before we took on multi-socket and high-delay fully-buffered memory links.
> And
> you're not even limited to that, certainly the CPU *could* check if several consecutive pages were available
> for both the source and destination, and just treat that as the unit of work.
>
> A problem is that you want to limit the amount of work you do or you start to impact
> interrupt response time. Some of the zArch string instructions are defined to do a
> "CPU determined" amount of work (although most of those are capped at a page).
>
> You also don't need to synchronize the units of work and the interrupt testing too much.
> For example, you can check for the next unit of work while doing the prior unit of work, and
> get started on that if no interrupt is pending, even before finishing the current unit of
> work (you just can't take an interrupt until all start started units of work are done).
>
That does not sound like a problem. For better or worse, x86 users accustomed to high interrupt latencies from the very beginning. All PC hardware/drivers infrastructure is already build on assumption that interrupts are slow.
> So quite large amounts of work are possible without having
> to resort to storing internal state at an interrupt.