By: Maynard Handley (name99.delete@this.name99.org), July 17, 2015 12:52 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on July 16, 2015 1:35 pm wrote:
> Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 16, 2015 1:55 am wrote:
> [snip interesting Xenon reordering]
> > The second data-point is the POWER8, from the user manual:
> >
> > 10.1.22 Store Queue and Store Forwarding
> >
> >
> > the point of coherency in strict program order. Note that this is not dependent on the mode the
> > processor or memory page is on. So it seems that IBM has opted for a stronger ordering for POWER8
> > (not sure how it compares to x86 as I didn't bother to check how loads are handled).
> >
> > These are only two data-points but it suggests that the low-hanging fruit
> > offered by the weak ordering model is interesting only on simpler designs.
>
> With speculative execution, strict program order committing of store instructions avoids
> having to distinguish between speculative stores and non-speculative stores (i.e., no preceding
> exceptions or branch mispredictions that would make the store not commit).
>
> While true out-of-order commit (i.e., irreversible changing of state, not something like
> Adrian Cristal et al.'s "Out-of-Order Commit Processors", which uses checkpointing) is
> possible, the modest benefits presumably do not justify the increase in complexity.
I don't know. If you committed to this all-in (rather than just tweaking one variable) you can accumulate quite a few benefits (once you have the checkpoint machinery in place). You can also go down the value-speculation road (Seznec, on 2014 class machinery, sees about 5% average win there, with up to around 30% on some workloads). Once you have value speculation you can execute a subset of instructions in-order meaning you can get the performance of say a 6-wide machine with a 2-wide in-order ALU and a 4-wide OoO. [Which you can see an extreme version of "handle moves or zeros at register rename time".]
You can also (once you have checkpoint machinery) be much more aggressive about re-using registers and assuming average (rather than worst case) ability to pick up data off the bypass bus, so you can have a register file with fewer read/write ports and either fewer registers or the same number of registers and a longer ROB (or "ROB equivalent" given that you're now doing commit rather differently).
[All this is orthogonal to the question of how to order the stores. Maybe being allowed by the architecture to do so out-of-order makes the other pieces easier or allows them to be more aggressive? But I have to admit it does seem like you will have to have a queue in there somewhere to hold stores until you are ABSOLUTELY certain that every prior instruction is legit, at which point why not write to the uncore in-order?]
> Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 16, 2015 1:55 am wrote:
> [snip interesting Xenon reordering]
> > The second data-point is the POWER8, from the user manual:
> >
> > 10.1.22 Store Queue and Store Forwarding
> >
> >
The LSU contains a 40-entry store reorder queue (SRQ) that holds real addresses
> > and a 40-entry store data queue (SDQ) that holds a quadword of data.
> >
> >
Stores are removed from the SRQ and SDQ and written to the cache inSo while the core itself is executing stores in any order internally they're written out to
> > program order after all the previous instructions are committed.
> >
> >
> > the point of coherency in strict program order. Note that this is not dependent on the mode the
> > processor or memory page is on. So it seems that IBM has opted for a stronger ordering for POWER8
> > (not sure how it compares to x86 as I didn't bother to check how loads are handled).
> >
> > These are only two data-points but it suggests that the low-hanging fruit
> > offered by the weak ordering model is interesting only on simpler designs.
>
> With speculative execution, strict program order committing of store instructions avoids
> having to distinguish between speculative stores and non-speculative stores (i.e., no preceding
> exceptions or branch mispredictions that would make the store not commit).
>
> While true out-of-order commit (i.e., irreversible changing of state, not something like
> Adrian Cristal et al.'s "Out-of-Order Commit Processors", which uses checkpointing) is
> possible, the modest benefits presumably do not justify the increase in complexity.
I don't know. If you committed to this all-in (rather than just tweaking one variable) you can accumulate quite a few benefits (once you have the checkpoint machinery in place). You can also go down the value-speculation road (Seznec, on 2014 class machinery, sees about 5% average win there, with up to around 30% on some workloads). Once you have value speculation you can execute a subset of instructions in-order meaning you can get the performance of say a 6-wide machine with a 2-wide in-order ALU and a 4-wide OoO. [Which you can see an extreme version of "handle moves or zeros at register rename time".]
You can also (once you have checkpoint machinery) be much more aggressive about re-using registers and assuming average (rather than worst case) ability to pick up data off the bypass bus, so you can have a register file with fewer read/write ports and either fewer registers or the same number of registers and a longer ROB (or "ROB equivalent" given that you're now doing commit rather differently).
[All this is orthogonal to the question of how to order the stores. Maybe being allowed by the architecture to do so out-of-order makes the other pieces easier or allows them to be more aggressive? But I have to admit it does seem like you will have to have a queue in there somewhere to hold stores until you are ABSOLUTELY certain that every prior instruction is legit, at which point why not write to the uncore in-order?]