By: RichardC (tich.delete@this.pobox.com), April 21, 2017 11:27 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on April 21, 2017 9:45 am wrote:
> The fundamental problem here is the balance of compute vs memory accesses. The underlying
> hardware bandwidths give you a vector L1 access on each cycle, and a vector multiply-add
> on each cycle. So as long as you keep below a 1-access-per-multiple-add access/memory
> ratio, you have a fighting chance to get good utilization.
>
> Now with an (M,K) matrix multiplied by a (K,N) matrix to give a (M,N) result,
> you've got K multiply-adds for each of M*N results (or K+1 if you're mul-adding
> the original C matrix, but ignore that for now).
>
> This means the compute (M*N*K) scales up faster than the the total data size
> M*K + K*N + M*N. So yeah, big matrices are easy. As long as 1 operand is in registers,
> you've got a pile of FLOPS to do with it and the other operand can stream through.
> A not-very-smart decomposition works fine because the access/compute ratio is good.
>
> It's when you get down to small/moderate matrix sizes that you need to find some
> better decomposition which uses the registers more effectively - viz using the
> registers to hold a 2-dimensional submatrix, not just a 1-dimensional vector, so
> that the subvectors streaming through get used in multiple FMA's to give a better
> access/memory ratio.
>
> Now there are also some micro-scheduling issues to worry about (though OoO machines
> often take care of those quite well), but the fundamental issue is to exploit the
> register/memory hierarchy to keep a manageably low access rate to L1 and L2.
I'm a little behind the times, because Haswell can actually do 2 8-wide FMA's per cycle,
for up to 32 SP FLOPS. But since it can also do up to 2 32B vector loads per cycle,
the issue of the access/compute ratio is pretty much as I described it.
But there's other stuff which can screw up the L1-access throughput, e.g.
http://www.agner.org/optimize/blog/read.php?i=415
"Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both."
So the details matter.
This more detailed information reinforces my point that you need to design your
software carefully to get a favorable access/compute ratio. And that having explicit
registers is very handy for that.
> The fundamental problem here is the balance of compute vs memory accesses. The underlying
> hardware bandwidths give you a vector L1 access on each cycle, and a vector multiply-add
> on each cycle. So as long as you keep below a 1-access-per-multiple-add access/memory
> ratio, you have a fighting chance to get good utilization.
>
> Now with an (M,K) matrix multiplied by a (K,N) matrix to give a (M,N) result,
> you've got K multiply-adds for each of M*N results (or K+1 if you're mul-adding
> the original C matrix, but ignore that for now).
>
> This means the compute (M*N*K) scales up faster than the the total data size
> M*K + K*N + M*N. So yeah, big matrices are easy. As long as 1 operand is in registers,
> you've got a pile of FLOPS to do with it and the other operand can stream through.
> A not-very-smart decomposition works fine because the access/compute ratio is good.
>
> It's when you get down to small/moderate matrix sizes that you need to find some
> better decomposition which uses the registers more effectively - viz using the
> registers to hold a 2-dimensional submatrix, not just a 1-dimensional vector, so
> that the subvectors streaming through get used in multiple FMA's to give a better
> access/memory ratio.
>
> Now there are also some micro-scheduling issues to worry about (though OoO machines
> often take care of those quite well), but the fundamental issue is to exploit the
> register/memory hierarchy to keep a manageably low access rate to L1 and L2.
I'm a little behind the times, because Haswell can actually do 2 8-wide FMA's per cycle,
for up to 32 SP FLOPS. But since it can also do up to 2 32B vector loads per cycle,
the issue of the access/compute ratio is pretty much as I described it.
But there's other stuff which can screw up the L1-access throughput, e.g.
http://www.agner.org/optimize/blog/read.php?i=415
"Contrary to my previous understanding, alignment makes a big difference on the speed at which vectors are read from L1 to register. If your data is 16B aligned rather than 32B aligned, a sequential read from L1 is no faster with 256-bit YMM reads than it is with 128-bit XMM reads. VMOVAPS and VMOVUPS have the same speed, but you cannot achieve 2 32B loads per cycle if the underlying data is not 32B aligned. If the data is 32B aligned, you still can't quite sustain 64 B/cycle of load with either, but you can get to about 54 B/cycle with both."
So the details matter.
This more detailed information reinforces my point that you need to design your
software carefully to get a favorable access/compute ratio. And that having explicit
registers is very handy for that.