By: --- (---.delete@this.redheron.com), May 19, 2022 1:20 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on May 18, 2022 11:20 pm wrote:
> Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 18, 2022 11:03 pm wrote:
> > The data for the algorithm I’m using (Monte Carlo simulation) fits in L3 cache so DRAM
> > bandwidth will not be my bottleneck. Maybe cache bandwidth will be the bottleneck.
> >
> > I don’t understand the comment that “visible registers
> > are almost irrelevant”. If a sequence of code has
> > more intermediate results than fit in the visible (architecture)
> > registers, the compiler will have to use loads
> > and stores to make up the difference. I thought the larger
> > size of the physical register file compared to the
> > visible (architecture) registers is to support out-of-order
> > and speculative execution. The compiled code has
> > to produce the same result on an in-order machine and an
> > out-of-order machine so the larger physical register
> > file on an out-of-order machine can not be used to reduce the number of loads and stores in the code.
>
> X86 processors have had store-load bypassing for decades to deal with compiler brain damage
> back when x86 had only 8 registers. So reloading registers is not as terrible as you think.
> As a programmer it does make me cringe. You are probably still using up addressing slots,
> and so that can hurt performance, but not latency of the ops due to the bypass.
And having to actually handle it as a load+store going through the LSQ is really the worst case scenario. I would expect that on a modern x86 it would be handled via the Stack Engine as essentially a Rename. Is that too optimistic?
On M1 there are a bunch of caveats, special cases, and situations that are described in patents but don't seem to be hooked (yet?) in the M1; but bottom line is at least some of these cases are handled by the Stack Engine version of Rename.
ON THE OTHER HAND...
M1 is very much designed under the assumption that NEON is about throughput, not latency. Consequently, and relevant in this case, many fewer of the Rename tricks are used for NEON registers than for FP. In principle x86 could make the same set of tradeoffs, but given that they seem less concerned with energy, maybe not?
> Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 18, 2022 11:03 pm wrote:
> > The data for the algorithm I’m using (Monte Carlo simulation) fits in L3 cache so DRAM
> > bandwidth will not be my bottleneck. Maybe cache bandwidth will be the bottleneck.
> >
> > I don’t understand the comment that “visible registers
> > are almost irrelevant”. If a sequence of code has
> > more intermediate results than fit in the visible (architecture)
> > registers, the compiler will have to use loads
> > and stores to make up the difference. I thought the larger
> > size of the physical register file compared to the
> > visible (architecture) registers is to support out-of-order
> > and speculative execution. The compiled code has
> > to produce the same result on an in-order machine and an
> > out-of-order machine so the larger physical register
> > file on an out-of-order machine can not be used to reduce the number of loads and stores in the code.
>
> X86 processors have had store-load bypassing for decades to deal with compiler brain damage
> back when x86 had only 8 registers. So reloading registers is not as terrible as you think.
> As a programmer it does make me cringe. You are probably still using up addressing slots,
> and so that can hurt performance, but not latency of the ops due to the bypass.
And having to actually handle it as a load+store going through the LSQ is really the worst case scenario. I would expect that on a modern x86 it would be handled via the Stack Engine as essentially a Rename. Is that too optimistic?
On M1 there are a bunch of caveats, special cases, and situations that are described in patents but don't seem to be hooked (yet?) in the M1; but bottom line is at least some of these cases are handled by the Stack Engine version of Rename.
ON THE OTHER HAND...
M1 is very much designed under the assumption that NEON is about throughput, not latency. Consequently, and relevant in this case, many fewer of the Rename tricks are used for NEON registers than for FP. In principle x86 could make the same set of tradeoffs, but given that they seem less concerned with energy, maybe not?