By: dmcq (dmcq.delete@this.fano.co.uk), August 30, 2014 2:15 am
Room: Moderated Discussions
Patrick Chase (patrickjchase.delete@this.gmail.com) on August 25, 2014 11:02 am wrote:
> anon (anon.delete@this.anon.com) on August 24, 2014 7:21 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on August 24, 2014 11:11 am wrote:
> > I will give him more credit than that! I attribute it to poor documentation. Without existing
> > experience, it is difficult to understand the 8.2. If you also miss the clarification in the
> > mfence instruction in a different part of the document that uses different wording, it's easy
> > to be confused.
>
> *very* true. I recall spending hours on that section the first time
> I tried to fully understand x86 ordering semantics. Sorry, Michael!
>
> > > One other remark: Keep in mind that a speculative core has to "hold" all stores in local
> > > buffers until the corresponding uop retires. If loading from the same address did indeed
> > > impose visibility/ordering constraints on the store then that would have require the OoO
> > > backend to be flushed up to at least the store. In other words, it would have basically
> > > the same cost as a fence.
> >
> > Very true, although you could also have a non-speculative local store buffer after instruction
> > completion. Not sure if anybody actually does that.
>
> I intended my comment more as a statement of the minimum cost (i.e. in an OoO engine
> you must *at least* flush the backend up to the store) than a comprehensive list.
>
> I've seen write-combining implementations that work as you describe in speculative cores with
> read-allocate L1 D$. The OoO pipeline retires stores in-order, so if you want meaningful coalescing
> of write-only lines it generally pays to do it in the non-speculative side. Needless to say
> in such a core you would also have to flush those before the load can proceed.
>
> > But you're right that ordering cost
> > of raw could defeat most benefit of store forwarding on a deep oooe pipeline.
>
> That's going to be particularly true in an ISA with a limited # of architectural regs, as
> stack spills/fills create RAW hazards galore. We were talking about x86 here, right? :-)
>
AMD presented slides a few years ago at AFDS/Fusion 11 about an open platform and hetrogenous computing and ARM had some slides for that about complexity and abstraction and standards as far as the memory model is concerned. The basic message was that if one didn't have a simple model that developers could think about then whatever about the implementation the developers would ignore the facilities or misuse them and make mistakes.
My own thought is that perhaps in the future we can think about things like locks as being implemented using a special locks chip which would handle all the stuff about making certain we know if locks are in use or not and by what processes and helps handle problems like a high priority task waiting on a low priority task or tasks crashing. Sending an increment task to a cache or memory is just a very simple form of that.
> anon (anon.delete@this.anon.com) on August 24, 2014 7:21 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on August 24, 2014 11:11 am wrote:
> > I will give him more credit than that! I attribute it to poor documentation. Without existing
> > experience, it is difficult to understand the 8.2. If you also miss the clarification in the
> > mfence instruction in a different part of the document that uses different wording, it's easy
> > to be confused.
>
> *very* true. I recall spending hours on that section the first time
> I tried to fully understand x86 ordering semantics. Sorry, Michael!
>
> > > One other remark: Keep in mind that a speculative core has to "hold" all stores in local
> > > buffers until the corresponding uop retires. If loading from the same address did indeed
> > > impose visibility/ordering constraints on the store then that would have require the OoO
> > > backend to be flushed up to at least the store. In other words, it would have basically
> > > the same cost as a fence.
> >
> > Very true, although you could also have a non-speculative local store buffer after instruction
> > completion. Not sure if anybody actually does that.
>
> I intended my comment more as a statement of the minimum cost (i.e. in an OoO engine
> you must *at least* flush the backend up to the store) than a comprehensive list.
>
> I've seen write-combining implementations that work as you describe in speculative cores with
> read-allocate L1 D$. The OoO pipeline retires stores in-order, so if you want meaningful coalescing
> of write-only lines it generally pays to do it in the non-speculative side. Needless to say
> in such a core you would also have to flush those before the load can proceed.
>
> > But you're right that ordering cost
> > of raw could defeat most benefit of store forwarding on a deep oooe pipeline.
>
> That's going to be particularly true in an ISA with a limited # of architectural regs, as
> stack spills/fills create RAW hazards galore. We were talking about x86 here, right? :-)
>
AMD presented slides a few years ago at AFDS/Fusion 11 about an open platform and hetrogenous computing and ARM had some slides for that about complexity and abstraction and standards as far as the memory model is concerned. The basic message was that if one didn't have a simple model that developers could think about then whatever about the implementation the developers would ignore the facilities or misuse them and make mistakes.
My own thought is that perhaps in the future we can think about things like locks as being implemented using a special locks chip which would handle all the stuff about making certain we know if locks are in use or not and by what processes and helps handle problems like a high priority task waiting on a low priority task or tasks crashing. Sending an increment task to a cache or memory is just a very simple form of that.