By: Gabriele Svelto (gabriele.svelto.delete@this.gmail.com), July 16, 2015 1:55 am
Room: Moderated Discussions
I think an interesting data-point would be to check how store-queues are implemented in weakly ordered processors vs x86 as that's one area where hardware designers might take advantage of the memory model for certain optimizations. I've got two data-points to throw in the discussion.
The first comes from Microsoft documentation regarding lock-free programming on the Xbox 360 CPU (aka Xenon) which is an in-order, very long pipeline PowerPC core. Here's an excerpt from MSDN:
(it's interesting to note that the Xenon had plenty of glass jaws, heavy-weight sync instructions and store-to-load forwarding from the store queue being two of the more infamous; it was a very fragile design performance-wise)
The second data-point is the POWER8, from the user manual:
10.1.22 Store Queue and Store Forwarding
These are only two data-points but it suggests that the low-hanging fruit offered by the weak ordering model is interesting only on simpler designs.
The first comes from Microsoft documentation regarding lock-free programming on the Xbox 360 CPU (aka Xenon) which is an in-order, very long pipeline PowerPC core. Here's an excerpt from MSDN:
Writes on Xbox 360 do not go directly to the L2 cache. Instead, in order to improve L2 cache write bandwidth, they go through store queues and then to store-gather buffers. The store-gather buffers allow 64-byte blocks to be written to the L2 cache in one operation. There are eight store-gather buffers, which allow efficient writing to several different areas of memory.
Even when the store-gather buffers are written to the L2 cache in strict FIFO order, this does not guarantee that individual writes are written to the L2 cache in order. For instance, imagine that the CPU writes to location 0x1000, then to location 0x2000, and then to location 0x1004. The first write allocates a store-gather buffer and puts it at the front of the queue. The second write allocates another store-gather buffer and puts it next in the queue. The third write adds its data to the first store-gather buffer, which remains at the front of the queue. Thus, the third write ends up going to the L2 cache before the second write.So while this core doesn't have a sophisticate mechanism for reordering writes what it does do is leverage the weak model to be able to gather stores into cache-line sized buffers and write them out one cacheline at a time thus making external observers see the writes in a potentially different order than the writer.
(it's interesting to note that the Xenon had plenty of glass jaws, heavy-weight sync instructions and store-to-load forwarding from the store queue being two of the more infamous; it was a very fragile design performance-wise)
The second data-point is the POWER8, from the user manual:
10.1.22 Store Queue and Store Forwarding
The LSU contains a 40-entry store reorder queue (SRQ) that holds real addresses and a 40-entry store data queue (SDQ) that holds a quadword of data.
Stores are removed from the SRQ and SDQ and written to the cache in program order after all the previous instructions are committed.So while the core itself is executing stores in any order internally they're written out to the point of coherency in strict program order. Note that this is not dependent on the mode the processor or memory page is on. So it seems that IBM has opted for a stronger ordering for POWER8 (not sure how it compares to x86 as I didn't bother to check how loads are handled).
These are only two data-points but it suggests that the low-hanging fruit offered by the weak ordering model is interesting only on simpler designs.