By: Brett (ggtgp.delete@this.yahoo.com), May 18, 2022 11:20 pm
Room: Moderated Discussions
Charlie Burnes (charlie.burnes.delete@this.no-spam.com) on May 18, 2022 11:03 pm wrote:
> The data for the algorithm I’m using (Monte Carlo simulation) fits in L3 cache so DRAM
> bandwidth will not be my bottleneck. Maybe cache bandwidth will be the bottleneck.
>
> I don’t understand the comment that “visible registers are almost irrelevant”. If a sequence of code has
> more intermediate results than fit in the visible (architecture) registers, the compiler will have to use loads
> and stores to make up the difference. I thought the larger size of the physical register file compared to the
> visible (architecture) registers is to support out-of-order and speculative execution. The compiled code has
> to produce the same result on an in-order machine and an out-of-order machine so the larger physical register
> file on an out-of-order machine can not be used to reduce the number of loads and stores in the code.
X86 processors have had store-load bypassing for decades to deal with compiler brain damage back when x86 had only 8 registers. So reloading registers is not as terrible as you think. As a programmer it does make me cringe. You are probably still using up addressing slots, and so that can hurt performance, but not latency of the ops due to the bypass.
> The data for the algorithm I’m using (Monte Carlo simulation) fits in L3 cache so DRAM
> bandwidth will not be my bottleneck. Maybe cache bandwidth will be the bottleneck.
>
> I don’t understand the comment that “visible registers are almost irrelevant”. If a sequence of code has
> more intermediate results than fit in the visible (architecture) registers, the compiler will have to use loads
> and stores to make up the difference. I thought the larger size of the physical register file compared to the
> visible (architecture) registers is to support out-of-order and speculative execution. The compiled code has
> to produce the same result on an in-order machine and an out-of-order machine so the larger physical register
> file on an out-of-order machine can not be used to reduce the number of loads and stores in the code.
X86 processors have had store-load bypassing for decades to deal with compiler brain damage back when x86 had only 8 registers. So reloading registers is not as terrible as you think. As a programmer it does make me cringe. You are probably still using up addressing slots, and so that can hurt performance, but not latency of the ops due to the bypass.