By: RichardC (tich.delete@this.pobox.com), April 17, 2017 5:25 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on April 17, 2017 3:37 pm wrote:
> > The general tradeoff in this kind of code is than a fine-grain decomposition allows
> > the inner kernel to work with data (almost) all in L1 cache, but introduces more
> > branchiness and short-vector overhead. Figuring out the right granularity is
> > complicated (and even more complicated when parallelizing across multicores).
> >
> >
>
> As i said, i didn't even touch multicore issues. And I tried various strategies in automated way.
I suspect the best way to decompose it to match the memory hierarchy is to leapfrog
back and forth at the different levels of the memory hierarchy, e.g. for AVX2, with
256bit SIMD registers = 8 x 32bit:
Inner loop keeps a 4 x 8 submatrix of A in 4 AVX2 registers
A small-enough-for-L1 submatrix of B is in L1
Columns of the B-submatrix stream through, accumulating a partial
result in AVX registers
And then you work your way outwards from that kernel - which does a
considerable number of multiply-adds for each SIMD access from L1
But it's probably a challenge to persuade a compiler to to do it right.
And I suspect it's an even bigger challenge to do linear algebra on the Mill,
where there's an underlying assumption that fetches and results are mostly
used only once (or maybe zero times), which is greatly in conflict with doing
fast linear algebra which needs to be arranged so that fetches are used as many
times as necessary to match the FLOPS rate to the memory-hierarchy bandwidth
constraints.
Getting the details right is a bear, of course, and not something I'd care to
do without a prospect of being paid for it :-)
> > The general tradeoff in this kind of code is than a fine-grain decomposition allows
> > the inner kernel to work with data (almost) all in L1 cache, but introduces more
> > branchiness and short-vector overhead. Figuring out the right granularity is
> > complicated (and even more complicated when parallelizing across multicores).
> >
> >
>
> As i said, i didn't even touch multicore issues. And I tried various strategies in automated way.
I suspect the best way to decompose it to match the memory hierarchy is to leapfrog
back and forth at the different levels of the memory hierarchy, e.g. for AVX2, with
256bit SIMD registers = 8 x 32bit:
Inner loop keeps a 4 x 8 submatrix of A in 4 AVX2 registers
A small-enough-for-L1 submatrix of B is in L1
Columns of the B-submatrix stream through, accumulating a partial
result in AVX registers
And then you work your way outwards from that kernel - which does a
considerable number of multiply-adds for each SIMD access from L1
But it's probably a challenge to persuade a compiler to to do it right.
And I suspect it's an even bigger challenge to do linear algebra on the Mill,
where there's an underlying assumption that fetches and results are mostly
used only once (or maybe zero times), which is greatly in conflict with doing
fast linear algebra which needs to be arranged so that fetches are used as many
times as necessary to match the FLOPS rate to the memory-hierarchy bandwidth
constraints.
Getting the details right is a bear, of course, and not something I'd care to
do without a prospect of being paid for it :-)