By: anon (spam.delete.delete@this.this.spam.com), April 23, 2017 6:51 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on April 23, 2017 6:18 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on April 22, 2017 10:43 am wrote:
>
> > You are just starting to understand the problems.
> > But still there are more things that you didn't start to understand.
> > 2 biggest points that you didn't start to understand yet are:
> > 1. FMAC latency. That a big problem on *wells, slightly less so on *lakes, but only slightly.
> > 1.1. impact of (1) on register allocation in the inner loop
> > 2. Decode/rename/retire bottleneck of 4 instruction per clock
>
> You're right, my sketch was screwy. We need subcolumn's in vector registers, and
> then have to multiply each subcolumn by a scalar picked out of another subcolumn. And AVX2 doesn't
> have multiple-vector-by-scalar, and has some awkward dispatch-port constraints on the
> way you can broadcast a scalar into a 256bit vector. But those constraints have been
> improving from each implementation to the next.
>
> On 1.1. I'm not altogether convinced that the limit on the numbers of architectural
> register names is critical - doesn't the OoO magic rename those onto a larger number
> of physical registers which would be large enough to deal with the high FMA latency ?
> And I understand the number of AVX registers goes up to 32 in AVX-512, so x86 is
> a rapidly-moving target. But since they've changed the names from XMM to YMM to ZMM,
> they've now hit the final architectural limit because they're at the end of the alphabet :-)
>
>
I don't know about SKL-X yet, and I'd be lying if I said I know for sure, but I think the Intel FP/vector PRF uses 128bit, not 256bit so 168x128bit means 168 XMM, but "only" 84 YMM registers.
Weird things happen with partial register access, that's for sure.
> Michael S (already5chosen.delete@this.yahoo.com) on April 22, 2017 10:43 am wrote:
>
> > You are just starting to understand the problems.
> > But still there are more things that you didn't start to understand.
> > 2 biggest points that you didn't start to understand yet are:
> > 1. FMAC latency. That a big problem on *wells, slightly less so on *lakes, but only slightly.
> > 1.1. impact of (1) on register allocation in the inner loop
> > 2. Decode/rename/retire bottleneck of 4 instruction per clock
>
> You're right, my sketch was screwy. We need subcolumn's in vector registers, and
> then have to multiply each subcolumn by a scalar picked out of another subcolumn. And AVX2 doesn't
> have multiple-vector-by-scalar, and has some awkward dispatch-port constraints on the
> way you can broadcast a scalar into a 256bit vector. But those constraints have been
> improving from each implementation to the next.
>
> On 1.1. I'm not altogether convinced that the limit on the numbers of architectural
> register names is critical - doesn't the OoO magic rename those onto a larger number
> of physical registers which would be large enough to deal with the high FMA latency ?
> And I understand the number of AVX registers goes up to 32 in AVX-512, so x86 is
> a rapidly-moving target. But since they've changed the names from XMM to YMM to ZMM,
> they've now hit the final architectural limit because they're at the end of the alphabet :-)
>
>
I don't know about SKL-X yet, and I'd be lying if I said I know for sure, but I think the Intel FP/vector PRF uses 128bit, not 256bit so 168x128bit means 168 XMM, but "only" 84 YMM registers.
Weird things happen with partial register access, that's for sure.