By: Jan Vlietinck (jvlietinck.delete@this.yahoo.com),
Room: Moderated Discussions
NoSpammer (no.delete@this.spam.com) on November 13, 2020 3:04 am wrote:
> anonymou5 (no.delete@this.spam.com) on November 12, 2020 10:31 pm wrote:
> > > > Does anyone have any stats that would support or disprove this assertion of 32 registers
> > > > being much better than 16? Where does the law of diminishing returns come in?
> > >
> > > There were some papers published in the early RISC era that put the "ideal" number of architectural
> > > registers in the high 20s. Since a machine with 27 registers is awkward, you get 32.
> > >
> > > Personal experience (I've done rather large amounts of S/360
> > > assembler over the years) is that 16, especially
> > > after you lose a few to various semi-dedicated purposes,
> > > is annoyingly tight. I think I'd have trouble using
> > > 32 well (just too much to keep track of for a human brain),
> > > but there were plenty of times I'd have killed for
> > > a few extra regs. Vax (especially if any FP was involved) was worse, x86... well, we won't go there.
> > >
> > > Of course not much of that is likely relevant any more, the papers were written mostly before OoO
> > > was common, and modern compiler register allocation tends to be very different than what humans do.
> >
> > Supposedly, when AMD did x86-64, the simulations suggested 24 as the sweet spot
> > but doing 32 wasn't quite within reach at the time. So 16 it became back then.
> >
> > [source: oral Q&A at some old talk by AMD's Kevin McGrath, back in those days]
> >
> > VEX permits R0-R15.
> > EVEX permits R0-R31.
> >
> > So far Intel and AMD simply have not "applied" that to the classic integer ops,
> > with the exception of a handful of recent instructions, e.g. BMI, BMI2, AMX.
>
> There was a paper out there attempting to analyze the number of registers required per compiled function.
> I remember that the huge majority required up to 16 registers, only a few percent required more.
>
> In my experience with assembly optimizations I never needed more than 12 registers for any
> inner integer loop, while for floating point code that number would be just over 20.
>
> However, looking at x64 compiler output, we often need to break the functions to smaller functions,
> because the compiler is not able to properly allocate registers to different loops and starts spilling,
> whereas individual parts will optimize properly to registers and run much faster.
>
> So more registers than strictly required is still good as it will make the compiler's job easier.
32 registers is definitely better than 16, ARM 32-bit was 16 registers, ARM 64-bit improved that to 32. This by sacrificing the predicate bits in the instruction encoding.
AVX512 also uses 32 registers BTW
ARM has additional advantages, in using 3 registers in one instruction:
dest = op(src1, scrc2)
x86 is:
src1 = op(src1, scrc2), destroying the value of scr1, leading to more register pressure and memory loads
> anonymou5 (no.delete@this.spam.com) on November 12, 2020 10:31 pm wrote:
> > > > Does anyone have any stats that would support or disprove this assertion of 32 registers
> > > > being much better than 16? Where does the law of diminishing returns come in?
> > >
> > > There were some papers published in the early RISC era that put the "ideal" number of architectural
> > > registers in the high 20s. Since a machine with 27 registers is awkward, you get 32.
> > >
> > > Personal experience (I've done rather large amounts of S/360
> > > assembler over the years) is that 16, especially
> > > after you lose a few to various semi-dedicated purposes,
> > > is annoyingly tight. I think I'd have trouble using
> > > 32 well (just too much to keep track of for a human brain),
> > > but there were plenty of times I'd have killed for
> > > a few extra regs. Vax (especially if any FP was involved) was worse, x86... well, we won't go there.
> > >
> > > Of course not much of that is likely relevant any more, the papers were written mostly before OoO
> > > was common, and modern compiler register allocation tends to be very different than what humans do.
> >
> > Supposedly, when AMD did x86-64, the simulations suggested 24 as the sweet spot
> > but doing 32 wasn't quite within reach at the time. So 16 it became back then.
> >
> > [source: oral Q&A at some old talk by AMD's Kevin McGrath, back in those days]
> >
> > VEX permits R0-R15.
> > EVEX permits R0-R31.
> >
> > So far Intel and AMD simply have not "applied" that to the classic integer ops,
> > with the exception of a handful of recent instructions, e.g. BMI, BMI2, AMX.
>
> There was a paper out there attempting to analyze the number of registers required per compiled function.
> I remember that the huge majority required up to 16 registers, only a few percent required more.
>
> In my experience with assembly optimizations I never needed more than 12 registers for any
> inner integer loop, while for floating point code that number would be just over 20.
>
> However, looking at x64 compiler output, we often need to break the functions to smaller functions,
> because the compiler is not able to properly allocate registers to different loops and starts spilling,
> whereas individual parts will optimize properly to registers and run much faster.
>
> So more registers than strictly required is still good as it will make the compiler's job easier.
32 registers is definitely better than 16, ARM 32-bit was 16 registers, ARM 64-bit improved that to 32. This by sacrificing the predicate bits in the instruction encoding.
AVX512 also uses 32 registers BTW
ARM has additional advantages, in using 3 registers in one instruction:
dest = op(src1, scrc2)
x86 is:
src1 = op(src1, scrc2), destroying the value of scr1, leading to more register pressure and memory loads


