By: David Kanter (dkanter.delete@this.realworldtech.com), July 16, 2015 7:34 am
Room: Moderated Discussions
Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on July 16, 2015 1:55 am wrote:
> I think an interesting data-point would be to check how store-queues are implemented in weakly
> ordered processors vs x86 as that's one area where hardware designers might take advantage of the
> memory model for certain optimizations. I've got two data-points to throw in the discussion.
>
> The first comes from Microsoft documentation regarding lock-free programming on the Xbox 360 CPU (aka
> Xenon) which is an in-order, very long pipeline PowerPC core. Here's an excerpt from MSDN:
>
>
> the weak model to be able to gather stores into cache-line sized buffers and write them out one cacheline at
> a time thus making external observers see the writes in a potentially different order than the writer.
>
> (it's interesting to note that the Xenon had plenty of glass jaws, heavy-weight
> sync instructions and store-to-load forwarding from the store queue being two
> of the more infamous; it was a very fragile design performance-wise)
>
> The second data-point is the POWER8, from the user manual:
>
> 10.1.22 Store Queue and Store Forwarding
>
>
> the point of coherency in strict program order. Note that this is not dependent on the mode the
> processor or memory page is on. So it seems that IBM has opted for a stronger ordering for POWER8
> (not sure how it compares to x86 as I didn't bother to check how loads are handled).
>
> These are only two data-points but it suggests that the low-hanging fruit
> offered by the weak ordering model is interesting only on simpler designs.
I don't mean to undermine a point that I made, but I don't think this data necessarily supports the argument.
One of the design goals of the POWER8 was a compatibility mode that has the same memory ordering and endianness as x86. This was requested specifically by Google, to enable them to port their software. So the fact that IBM has a structure very similar to Intel's store buffering is not surprising.
Moreover, the POWER series has always had a write-through L1D cache. That means that doing cache line level buffering is even more important than for Intel's write-back L1D (which does address level buffering), to coalesce all the writes together.
Does anyone know how store buffering was handled in POWER 4, 5, & 7?
David
> I think an interesting data-point would be to check how store-queues are implemented in weakly
> ordered processors vs x86 as that's one area where hardware designers might take advantage of the
> memory model for certain optimizations. I've got two data-points to throw in the discussion.
>
> The first comes from Microsoft documentation regarding lock-free programming on the Xbox 360 CPU (aka
> Xenon) which is an in-order, very long pipeline PowerPC core. Here's an excerpt from MSDN:
>
>
Writes on Xbox 360 do not go directly to the L2 cache. Instead, in order to improve L2 cache
> write bandwidth, they go through store queues and then to store-gather buffers. The store-gather
> buffers allow 64-byte blocks to be written to the L2 cache in one operation. There are eight
> store-gather buffers, which allow efficient writing to several different areas of memory.
>
>
Even when the store-gather buffers are written to the L2 cache in strict FIFO order, this does not guaranteeSo while this core doesn't have a sophisticate mechanism for reordering writes what it does do is leverage
> that individual writes are written to the L2 cache in order. For instance, imagine that the CPU writes to location
> 0x1000, then to location 0x2000, and then to location 0x1004. The first write allocates a store-gather buffer
> and puts it at the front of the queue. The second write allocates another store-gather buffer and puts it next
> in the queue. The third write adds its data to the first store-gather buffer, which remains at the front of
> the queue. Thus, the third write ends up going to the L2 cache before the second write.
>
>
> the weak model to be able to gather stores into cache-line sized buffers and write them out one cacheline at
> a time thus making external observers see the writes in a potentially different order than the writer.
>
> (it's interesting to note that the Xenon had plenty of glass jaws, heavy-weight
> sync instructions and store-to-load forwarding from the store queue being two
> of the more infamous; it was a very fragile design performance-wise)
>
> The second data-point is the POWER8, from the user manual:
>
> 10.1.22 Store Queue and Store Forwarding
>
>
The LSU contains a 40-entry store reorder queue (SRQ) that holds real addresses
> and a 40-entry store data queue (SDQ) that holds a quadword of data.
>
>
Stores are removed from the SRQ and SDQ and written to the cache inSo while the core itself is executing stores in any order internally they're written out to
> program order after all the previous instructions are committed.
>
>
> the point of coherency in strict program order. Note that this is not dependent on the mode the
> processor or memory page is on. So it seems that IBM has opted for a stronger ordering for POWER8
> (not sure how it compares to x86 as I didn't bother to check how loads are handled).
>
> These are only two data-points but it suggests that the low-hanging fruit
> offered by the weak ordering model is interesting only on simpler designs.
I don't mean to undermine a point that I made, but I don't think this data necessarily supports the argument.
One of the design goals of the POWER8 was a compatibility mode that has the same memory ordering and endianness as x86. This was requested specifically by Google, to enable them to port their software. So the fact that IBM has a structure very similar to Intel's store buffering is not surprising.
Moreover, the POWER series has always had a write-through L1D cache. That means that doing cache line level buffering is even more important than for Intel's write-back L1D (which does address level buffering), to coalesce all the writes together.
Does anyone know how store buffering was handled in POWER 4, 5, & 7?
David