By: --- (---.delete@this.redheron.com), September 20, 2021 7:11 am
Room: Moderated Discussions
Jörn Engel (joern.delete@this.purestorage.com) on September 19, 2021 8:46 pm wrote:
> Anon (no.delete@this.spam.com) on September 19, 2021 5:32 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> > > > It's a night here now.
> > >
> > > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> > > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> > > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
> > >
> > > Here are results:
> > > 8-byte (64b) accesses:
> > > 0 1064
> > > 1 1170
> > >
> > > 16-byte (128b) accesses:
> > > 0 483
> > > 1 701
> > >
> > > 32-byte (256b) accesses:
> > > 0 256
> > > 1 468
> > >
> > > Misalignment penalty [of streaming add):
> > > 8-byte - 1.10x
> > > 16-byte - 1.45x
> > > 32-byte - 1.83x
>
> Thank you!
>
> > I think you should point out when the access cross a cache line or not.
>
> Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes
> up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of
> the measurements could be explained if performance was limited by the numbers of cachelines read.
>
> 1064 - 1.77 cachelines / cycle
> 1170 - 1.81 cachelines / cycle
> 483 - 1.95 cachelines / cycle
> 701 - 1.67 cachelines / cycle
> 256 - 1.83 cachelines / cycle
> 468 - 1.51 cachelines / cycle
>
> Not sure. I'll have to play around with the code a bit.
As always, true understanding requires knowing exactly what you want to test.
One dimension is crossing "units", from bank (probably 8B aligned with Intel) to cache line to page.
A second dimension is latency vs throughput. Latency may remain the same while throughput is halved (ie a mis-aligned load uses two "units" of some constrained resource). Of course the extent to which this affects real code depends on how aggressively you are approaching load bandwidth.
If you really want to go all in, you then have to cover the issue of load/store dependencies, and what happens when those dependencies are partial...
----------------
It’s natural to believe that the basic unit of a cache is the cache line (say 64B) and that reading from that is like reading from an array. But in fact most designs split a cache line into multiple smaller pieces that are somewhat independently addressable, either for power reasons or for banking reasons (ie to allow multiple independent accesses into the same line in a cycle).
Intel up till, I think Sandy Bridge, banked by 8B units (ie imagine the stack of cache lines 64B wide, split them at 8B boundaries, and these 8 sub-arrays form 8 independently addressable banks). This was subject to frequent (ie noticeable) bank collisions (two accesses want to hit the same bank, perhaps in the same row, perhaps in different rows). This was somehow improved in Sandy Bridge and later — though apparently it’s still based on 8B wide units, and details remain uncertain/unknown.
It’s not clear, in such a design, that a load that crosses two banks would automatically cost no extra cycles (perhaps the logic to shift and stitch two loads together might require an extra cycle?)
This is even more true in really old (by our standards) designs which can sustain only one access (or perhaps one load and one store) per cycle, so may have to run the load twice to get the two pieces before stitching them.
In the case of Apple (this is probably just as true of other manufacturers) we can see from the patent record how they do this.
Apple has two attempts:
The first is 2005 https://patents.google.com/patent/US8117404B2.
Notable features are
– use of a predictor to track loads or stores that will cross cache line boundaries
– if the predictor fires then crack the instruction into three parts that load low, load high, and stitch.
In other words
– already at this stage (2005) the problem is when crossing a line. Within line is not a problem — we’ll see why.
– crossing a line is handled slowly but by reusing existing CPU functionality.
The 2011 patent, https://patents.google.com/patent/US9131899B2 , is much more slick. Here the important parts are:
– the cache is defined as two banks, one holding even lines, one holding odd lines. Regardless of their interior sub-banking, these two banks can be accessed in parallel.
– the array holding addresses for the store queue has a structure that can also describe both an even line and an odd line, with the store data held in a separate array.
These mean that
– a load that straddles lines (and in the usual case of both lines in cache) can access both banks in parallel and pick up the data, and stitch it within cycle time. ie you don’t even pay a penalty for crossing cache lines.
– a load that hits in store queue can, likewise, do the equivalent of probing for address matches in the even and odd halves of line address space to detect presence of matching non-aligned stores.
– there is also a piece of auxiliary storage (the “sidecar”) to hold parts of the load for the truly horror story cases like part of the load from a line in cache, part in DRAM, or a short store that crosses a cache boundary and is within a large load that also crosses a cache boundary (so some data from store queue, some data from L1D).
So almost all cases pay no latency cost, though they will pay a bandwidth cost (since the loads will use two “units” of the limited bandwidth to the cache). I believe that a transaction that crosses a page will pay a one cycle latency cost because the TB will have to be hit twice, but I have not validated that.
The reason I started (for Apple) with cache line crossings is that the Apple sub-arrays within the two banks are remarkably short!
(2014) https://patents.google.com/patent/US9448936B2
describes them as being one byte wide. This means (among other things) that stores can happen concurrently with loads to the same line, as long as they touch different bytes, and that is the focus of the patent.
As far as I can tell experimentally (though the full picture still remains murky) this is true as of the M1, except that the minimal width is the double-byte not the single byte.
In other words, whereas the Intel history was something like
– extract all the data from an 8B unit — that will cover an aligned 8B load and many misaligned shorter loads, then
– expand that to, in one cycle, extract from two 8B units and stitch
the Apple history was more like
– from the beginning, figure out which sub-banks (ie which bytes within a line) to activate and
– how to route the bytes collected from those sub-banks to the load store unit
What seems like a reasonable compromise, or an easy extension, in each case depends on your starting point — in this case “aggregating bytes” vs “de-aggregating oct-bytes”.
> Anon (no.delete@this.spam.com) on September 19, 2021 5:32 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on September 19, 2021 4:46 pm wrote:
> > > > It's a night here now.
> > >
> > > So, I measured time, in microsecond, of summation of 8,000,000 L1D-resident 64-bit numbers
> > > (16,000 B buffer, summation repeated 4,000 times) at different alignments and using different
> > > access/arithmetic width. CPU - Skylake Client (Xeon E-2176G) downclocked to 4.25 GHz.
> > >
> > > Here are results:
> > > 8-byte (64b) accesses:
> > > 0 1064
> > > 1 1170
> > >
> > > 16-byte (128b) accesses:
> > > 0 483
> > > 1 701
> > >
> > > 32-byte (256b) accesses:
> > > 0 256
> > > 1 468
> > >
> > > Misalignment penalty [of streaming add):
> > > 8-byte - 1.10x
> > > 16-byte - 1.45x
> > > 32-byte - 1.83x
>
> Thank you!
>
> > I think you should point out when the access cross a cache line or not.
>
> Interesting idea. With 8-byte access, roughly 12.5% off accesses should cross a cacheline. Ratio goes
> up with access size, as does the misalignment penalty. The numbers don't quite match up, but a lot of
> the measurements could be explained if performance was limited by the numbers of cachelines read.
>
> 1064 - 1.77 cachelines / cycle
> 1170 - 1.81 cachelines / cycle
> 483 - 1.95 cachelines / cycle
> 701 - 1.67 cachelines / cycle
> 256 - 1.83 cachelines / cycle
> 468 - 1.51 cachelines / cycle
>
> Not sure. I'll have to play around with the code a bit.
As always, true understanding requires knowing exactly what you want to test.
One dimension is crossing "units", from bank (probably 8B aligned with Intel) to cache line to page.
A second dimension is latency vs throughput. Latency may remain the same while throughput is halved (ie a mis-aligned load uses two "units" of some constrained resource). Of course the extent to which this affects real code depends on how aggressively you are approaching load bandwidth.
If you really want to go all in, you then have to cover the issue of load/store dependencies, and what happens when those dependencies are partial...
----------------
It’s natural to believe that the basic unit of a cache is the cache line (say 64B) and that reading from that is like reading from an array. But in fact most designs split a cache line into multiple smaller pieces that are somewhat independently addressable, either for power reasons or for banking reasons (ie to allow multiple independent accesses into the same line in a cycle).
Intel up till, I think Sandy Bridge, banked by 8B units (ie imagine the stack of cache lines 64B wide, split them at 8B boundaries, and these 8 sub-arrays form 8 independently addressable banks). This was subject to frequent (ie noticeable) bank collisions (two accesses want to hit the same bank, perhaps in the same row, perhaps in different rows). This was somehow improved in Sandy Bridge and later — though apparently it’s still based on 8B wide units, and details remain uncertain/unknown.
It’s not clear, in such a design, that a load that crosses two banks would automatically cost no extra cycles (perhaps the logic to shift and stitch two loads together might require an extra cycle?)
This is even more true in really old (by our standards) designs which can sustain only one access (or perhaps one load and one store) per cycle, so may have to run the load twice to get the two pieces before stitching them.
In the case of Apple (this is probably just as true of other manufacturers) we can see from the patent record how they do this.
Apple has two attempts:
The first is 2005 https://patents.google.com/patent/US8117404B2.
Notable features are
– use of a predictor to track loads or stores that will cross cache line boundaries
– if the predictor fires then crack the instruction into three parts that load low, load high, and stitch.
In other words
– already at this stage (2005) the problem is when crossing a line. Within line is not a problem — we’ll see why.
– crossing a line is handled slowly but by reusing existing CPU functionality.
The 2011 patent, https://patents.google.com/patent/US9131899B2 , is much more slick. Here the important parts are:
– the cache is defined as two banks, one holding even lines, one holding odd lines. Regardless of their interior sub-banking, these two banks can be accessed in parallel.
– the array holding addresses for the store queue has a structure that can also describe both an even line and an odd line, with the store data held in a separate array.
These mean that
– a load that straddles lines (and in the usual case of both lines in cache) can access both banks in parallel and pick up the data, and stitch it within cycle time. ie you don’t even pay a penalty for crossing cache lines.
– a load that hits in store queue can, likewise, do the equivalent of probing for address matches in the even and odd halves of line address space to detect presence of matching non-aligned stores.
– there is also a piece of auxiliary storage (the “sidecar”) to hold parts of the load for the truly horror story cases like part of the load from a line in cache, part in DRAM, or a short store that crosses a cache boundary and is within a large load that also crosses a cache boundary (so some data from store queue, some data from L1D).
So almost all cases pay no latency cost, though they will pay a bandwidth cost (since the loads will use two “units” of the limited bandwidth to the cache). I believe that a transaction that crosses a page will pay a one cycle latency cost because the TB will have to be hit twice, but I have not validated that.
The reason I started (for Apple) with cache line crossings is that the Apple sub-arrays within the two banks are remarkably short!
(2014) https://patents.google.com/patent/US9448936B2
describes them as being one byte wide. This means (among other things) that stores can happen concurrently with loads to the same line, as long as they touch different bytes, and that is the focus of the patent.
As far as I can tell experimentally (though the full picture still remains murky) this is true as of the M1, except that the minimal width is the double-byte not the single byte.
In other words, whereas the Intel history was something like
– extract all the data from an 8B unit — that will cover an aligned 8B load and many misaligned shorter loads, then
– expand that to, in one cycle, extract from two 8B units and stitch
the Apple history was more like
– from the beginning, figure out which sub-banks (ie which bytes within a line) to activate and
– how to route the bytes collected from those sub-banks to the load store unit
What seems like a reasonable compromise, or an easy extension, in each case depends on your starting point — in this case “aggregating bytes” vs “de-aggregating oct-bytes”.