By: hobold (hobold.delete@this.vectorizer.org), November 13, 2012 5:13 pm
Room: Moderated Discussions
Eric (eric.kjellen.delete@this.gmail.com) on November 13, 2012 4:10 pm wrote:
[...]
> I seem to recall that a scatter instruction was included in LRBni (and
> this article that I found from Michael Abrash makes mention of one); is there any particular reason
> why it could not be included in AVX2 or could not foreseeably be added in future iterations?
I would guess that in the context of a fully cache coherent manycore machine, when you try to optimize the scatter operation to store more than a single vector element at a time, you run into problems. Memory transactions probably don't make that any easier.
For example, if a scatter operation has to be aborted and restarted for some reason, does it (semantically) execute all or nothing? Or can it be in a partly completed state? Larrabee made partly completed state information architecturally visible (in a mandatory boolean mask register), but did not support transactional memory. As far as I know, AVX* does not expose such internal state.
Or when two "simultaneous" scatter operations from two different cores fight over overlapping memory addresses, and then one or both operations has to be undone and later rerun ... is it even possible to decide on a consistent specification of what the memory contents ought to be?
Does the coherency protocol support groups of in-flight memory accesses that are semantically related to one another? With respect to one or more memory transactions?
I could be blowing this issue out of proportions due to personal cluelessness. But it does seem rather complicated to me.
[...]
> I seem to recall that a scatter instruction was included in LRBni (and
> this article that I found from Michael Abrash makes mention of one); is there any particular reason
> why it could not be included in AVX2 or could not foreseeably be added in future iterations?
I would guess that in the context of a fully cache coherent manycore machine, when you try to optimize the scatter operation to store more than a single vector element at a time, you run into problems. Memory transactions probably don't make that any easier.
For example, if a scatter operation has to be aborted and restarted for some reason, does it (semantically) execute all or nothing? Or can it be in a partly completed state? Larrabee made partly completed state information architecturally visible (in a mandatory boolean mask register), but did not support transactional memory. As far as I know, AVX* does not expose such internal state.
Or when two "simultaneous" scatter operations from two different cores fight over overlapping memory addresses, and then one or both operations has to be undone and later rerun ... is it even possible to decide on a consistent specification of what the memory contents ought to be?
Does the coherency protocol support groups of in-flight memory accesses that are semantically related to one another? With respect to one or more memory transactions?
I could be blowing this issue out of proportions due to personal cluelessness. But it does seem rather complicated to me.



