By: Patrick Chase (patrickjchase.delete@this.gmail.com), August 25, 2014 10:02 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on August 24, 2014 7:21 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on August 24, 2014 11:11 am wrote:
> I will give him more credit than that! I attribute it to poor documentation. Without existing
> experience, it is difficult to understand the 8.2. If you also miss the clarification in the
> mfence instruction in a different part of the document that uses different wording, it's easy
> to be confused.
*very* true. I recall spending hours on that section the first time I tried to fully understand x86 ordering semantics. Sorry, Michael!
> > One other remark: Keep in mind that a speculative core has to "hold" all stores in local
> > buffers until the corresponding uop retires. If loading from the same address did indeed
> > impose visibility/ordering constraints on the store then that would have require the OoO
> > backend to be flushed up to at least the store. In other words, it would have basically
> > the same cost as a fence.
>
> Very true, although you could also have a non-speculative local store buffer after instruction
> completion. Not sure if anybody actually does that.
I intended my comment more as a statement of the minimum cost (i.e. in an OoO engine you must *at least* flush the backend up to the store) than a comprehensive list.
I've seen write-combining implementations that work as you describe in speculative cores with read-allocate L1 D$. The OoO pipeline retires stores in-order, so if you want meaningful coalescing of write-only lines it generally pays to do it in the non-speculative side. Needless to say in such a core you would also have to flush those before the load can proceed.
> But you're right that ordering cost
> of raw could defeat most benefit of store forwarding on a deep oooe pipeline.
That's going to be particularly true in an ISA with a limited # of architectural regs, as stack spills/fills create RAW hazards galore. We were talking about x86 here, right? :-)
> Patrick Chase (patrickjchase.delete@this.gmail.com) on August 24, 2014 11:11 am wrote:
> I will give him more credit than that! I attribute it to poor documentation. Without existing
> experience, it is difficult to understand the 8.2. If you also miss the clarification in the
> mfence instruction in a different part of the document that uses different wording, it's easy
> to be confused.
*very* true. I recall spending hours on that section the first time I tried to fully understand x86 ordering semantics. Sorry, Michael!
> > One other remark: Keep in mind that a speculative core has to "hold" all stores in local
> > buffers until the corresponding uop retires. If loading from the same address did indeed
> > impose visibility/ordering constraints on the store then that would have require the OoO
> > backend to be flushed up to at least the store. In other words, it would have basically
> > the same cost as a fence.
>
> Very true, although you could also have a non-speculative local store buffer after instruction
> completion. Not sure if anybody actually does that.
I intended my comment more as a statement of the minimum cost (i.e. in an OoO engine you must *at least* flush the backend up to the store) than a comprehensive list.
I've seen write-combining implementations that work as you describe in speculative cores with read-allocate L1 D$. The OoO pipeline retires stores in-order, so if you want meaningful coalescing of write-only lines it generally pays to do it in the non-speculative side. Needless to say in such a core you would also have to flush those before the load can proceed.
> But you're right that ordering cost
> of raw could defeat most benefit of store forwarding on a deep oooe pipeline.
That's going to be particularly true in an ISA with a limited # of architectural regs, as stack spills/fills create RAW hazards galore. We were talking about x86 here, right? :-)