By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), July 16, 2015 9:33 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on July 15, 2015 8:29 am wrote:
> dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
[snip]
> > > As for string instructions (rep stos/rep movs) I've relied on the
> > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > so I'm confident this works (why would the CPU push out string writes in random order).
> >
> > I doubt a CPU would want to push it out in random order, but
> > consider the problem in general with weak consistency.
> > If there are two memory controllers and the string occupies
> > three lines then the second memory controller might
> > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> >
>
> As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> instructions are interruptible/restartable and the only saved states are direction, two pointers and
> counter. With such minimal state saved only continuous operation can be correctly restarted.
Does the state need to be architectural? For example, is there a guarantee that the count on interrupt indicates the stopping point? If microarchitectural storage could be used to track what has been done (and perhaps even offloading copying to a DMA engine on interrupt/thread switch), perhaps using -1 as a magic value to check for a partial operation in this microarchitectural storage (the overhead of such a check would be tiny for an actual maximum-sized operation and might be acceptable to allow faster progress in the operation in the common case of no interruption).
While it is easy to imagine complex schemes to handle string operations (even "completing" a copy before the source has been fully read using shared coherence state marking for regions possibly even with versioned memory), it is more difficult to imagine current uses that would justify extreme complexity for handling string operations.
> dmcq (dmcq.delete@this.fano.co.uk) on July 15, 2015 8:04 am wrote:
> > NoSpammer (no.delete@this.spam.com) on July 15, 2015 7:05 am wrote:
[snip]
> > > As for string instructions (rep stos/rep movs) I've relied on the
> > > observation that if you see the first bytes correctly and the last bytes correctly, you also probably
> > > see everything inbetween. Running CRC checks on such messages for about 3 years has shown no errors
> > > so I'm confident this works (why would the CPU push out string writes in random order).
> >
> > I doubt a CPU would want to push it out in random order, but
> > consider the problem in general with weak consistency.
> > If there are two memory controllers and the string occupies
> > three lines then the second memory controller might
> > be satisfying some requests by another CPU delaying the write of the middle section of the string.
> >
>
> As long as we are talking about ordering in WB/WT regions, order of arrival to external memory does
> not matter. What matters is *observed* order governed by cache lines ownership. And the later, in
> case of x86 strings instructions, can't deviate too far from the continuous, because on x86 strings
> instructions are interruptible/restartable and the only saved states are direction, two pointers and
> counter. With such minimal state saved only continuous operation can be correctly restarted.
Does the state need to be architectural? For example, is there a guarantee that the count on interrupt indicates the stopping point? If microarchitectural storage could be used to track what has been done (and perhaps even offloading copying to a DMA engine on interrupt/thread switch), perhaps using -1 as a magic value to check for a partial operation in this microarchitectural storage (the overhead of such a check would be tiny for an actual maximum-sized operation and might be acceptable to allow faster progress in the operation in the common case of no interruption).
While it is easy to imagine complex schemes to handle string operations (even "completing" a copy before the source has been fully read using shared coherence state marking for regions possibly even with versioned memory), it is more difficult to imagine current uses that would justify extreme complexity for handling string operations.