By: Charlie Burnes (charlie.burnes.delete@this.no-spam.com), May 18, 2022 11:03 pm
Room: Moderated Discussions
The data for the algorithm I’m using (Monte Carlo simulation) fits in L3 cache so DRAM bandwidth will not be my bottleneck. Maybe cache bandwidth will be the bottleneck.
I don’t understand the comment that “visible registers are almost irrelevant”. If a sequence of code has more intermediate results than fit in the visible (architecture) registers, the compiler will have to use loads and stores to make up the difference. I thought the larger size of the physical register file compared to the visible (architecture) registers is to support out-of-order and speculative execution. The compiled code has to produce the same result on an in-order machine and an out-of-order machine so the larger physical register file on an out-of-order machine can not be used to reduce the number of loads and stores in the code.
I don’t understand the comment that “visible registers are almost irrelevant”. If a sequence of code has more intermediate results than fit in the visible (architecture) registers, the compiler will have to use loads and stores to make up the difference. I thought the larger size of the physical register file compared to the visible (architecture) registers is to support out-of-order and speculative execution. The compiled code has to produce the same result on an in-order machine and an out-of-order machine so the larger physical register file on an out-of-order machine can not be used to reduce the number of loads and stores in the code.