By: --- (---.delete@this.redheron.com), November 10, 2021 6:43 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on November 10, 2021 9:26 am wrote:
> Thanks for the feedback. The bandwidth stuff is coming together nicely, almost done, and soon I'll have
> latency (where again I'll validate against your graphs, but again details and goals will differ so sure,
> one should not expect that our results are identical; just that they show the same sort of thing).
Andrei, I discovered something interesting that you may want to check out within your code!
If you've been reading along my evolving explanations of what's going on, the following might occur to you:
In a basic reduction (so a stream of loads), think something like
for(){sum+=array[i];}
he address stream being presented to the CPU is a linear stream.
In the LSU this is split into two sub-streams (directed at even and odd cache lines), but these streams are still sequential. So even though we now understand that the L1 has, conceptually, three ways to extract data from it, that requires us to feed addresses corresponding to three distinct lines to the L1, so right now we're only using that third path under the conditions where we're at the end of the hot line and so one load comes from the hot line, and the next load in that stream then goes to the cache proper.
What if, instead, we generate a stream of somewhat interleaved addresses so that almost every cycle of the three addresses presented to the cache, one can hit in the hot line and two can hit in the cache proper. Would such a scheme possibly work?
Oh yes it would! In a perfect world this could give us a full 48B loaded per cycle. We can't hit exactly that, but we do get as high as 44B/cycle, 140GB/s!
So essentially change any reduction code to something like
for(){sumA+=a[i]; sumB+=b[i];}
where a and b are widely separated arrays. In a real reduction, the obvious implementation would be to split the array in half, set b[] to the halfway point, and perform the final sumA+sumB (or whatever) after the loop. Very easy code-mod, gets you an extra 40% performance.
There remains some weirdness in that the effect is only present for some fraction of the cache. I'm not sure quite what the gating factor is but in my particular code I see it up to a test depth of 32..64kB, but not beyond that. Still haven't figured that out.
Do you see something like this on M1 or M1P if you mod your code like this? I suspect it's very much an Apple-specific thing, that it will buy you nothing anywhere else. I don't know if you still have access to an M1 or M1M to test, but you at least want to add a mod like this to your code base.
This also seems sufficiently obscure (though remarkably powerful) that even most of Apple seem unaware of it. In particular I would expect that the STL semantics for reduction operations (like accumulate(), which I used in one of my tests) allow for arbitrary re-ordering of the operations [and one can argue that -ffastmath should allow it for even a basic for loop!], and so you would hope that Apple's STL would split the reduction into two half arrays as I described, likewise [ambitiously] the compiler would split the loop in two and run the upper and lower halves in parallel. But no such luck.
(Also BTW I implemented my equivalent of CLFlip and see almost exactly the same curve as you see out through L2 -- of course beyond that M1M behaves very differently, but I see exactly what I would expect for M1's SLC and DRAM.)
> Thanks for the feedback. The bandwidth stuff is coming together nicely, almost done, and soon I'll have
> latency (where again I'll validate against your graphs, but again details and goals will differ so sure,
> one should not expect that our results are identical; just that they show the same sort of thing).
Andrei, I discovered something interesting that you may want to check out within your code!
If you've been reading along my evolving explanations of what's going on, the following might occur to you:
In a basic reduction (so a stream of loads), think something like
for(){sum+=array[i];}
he address stream being presented to the CPU is a linear stream.
In the LSU this is split into two sub-streams (directed at even and odd cache lines), but these streams are still sequential. So even though we now understand that the L1 has, conceptually, three ways to extract data from it, that requires us to feed addresses corresponding to three distinct lines to the L1, so right now we're only using that third path under the conditions where we're at the end of the hot line and so one load comes from the hot line, and the next load in that stream then goes to the cache proper.
What if, instead, we generate a stream of somewhat interleaved addresses so that almost every cycle of the three addresses presented to the cache, one can hit in the hot line and two can hit in the cache proper. Would such a scheme possibly work?
Oh yes it would! In a perfect world this could give us a full 48B loaded per cycle. We can't hit exactly that, but we do get as high as 44B/cycle, 140GB/s!
So essentially change any reduction code to something like
for(){sumA+=a[i]; sumB+=b[i];}
where a and b are widely separated arrays. In a real reduction, the obvious implementation would be to split the array in half, set b[] to the halfway point, and perform the final sumA+sumB (or whatever) after the loop. Very easy code-mod, gets you an extra 40% performance.
There remains some weirdness in that the effect is only present for some fraction of the cache. I'm not sure quite what the gating factor is but in my particular code I see it up to a test depth of 32..64kB, but not beyond that. Still haven't figured that out.
Do you see something like this on M1 or M1P if you mod your code like this? I suspect it's very much an Apple-specific thing, that it will buy you nothing anywhere else. I don't know if you still have access to an M1 or M1M to test, but you at least want to add a mod like this to your code base.
This also seems sufficiently obscure (though remarkably powerful) that even most of Apple seem unaware of it. In particular I would expect that the STL semantics for reduction operations (like accumulate(), which I used in one of my tests) allow for arbitrary re-ordering of the operations [and one can argue that -ffastmath should allow it for even a basic for loop!], and so you would hope that Apple's STL would split the reduction into two half arrays as I described, likewise [ambitiously] the compiler would split the loop in two and run the upper and lower halves in parallel. But no such luck.
(Also BTW I implemented my equivalent of CLFlip and see almost exactly the same curve as you see out through L2 -- of course beyond that M1M behaves very differently, but I see exactly what I would expect for M1's SLC and DRAM.)