By: Freddie (freddie.delete@this.witherden.org), June 2, 2022 7:22 am
Room: Moderated Discussions
Peter Lewis (peter.delete@this.notyahoo.com) on June 2, 2022 1:22 am wrote:
> ARM will increase the number of instructions decoded per clock, but the increase will be faster for ARM.
A few points of note. Firstly, decode is less important than it used to be due to micro-op caches. Both Intel and AMD employ them and they have been growing in size substantially on a generation-to-generation basis. (Interestingly, they are also used in some non-Apple ARM designs too.) This mitigates a lot of the issues associated with variable length decoding.
Secondly, just counting instructions is not sensible. Consider:
vfnmadd231pd zmm21,zmm23,ZMMWORD PTR [rsi+0xb280]
which is one x86 instruction but in ARM it would translate to four (two adds due to the size of the immediate, a load, and an FMA). Plus it would cost you a vector register to store the load and a GPR if you wanted to preserve the value of "rsi". On x86 that instruction is 10 bytes (and there is still room for a bigger immediate) whereas four ARM instructions are 16 bytes.
> Because of the difficulty of decoding variable length instructions in parallel, x86 will favor wider vectors than ARM. This is something we already see today with 512-bit vector operations (two of them per clock) on x86 and 128-bit vector operations (four of them per clock) on Apple’s M1.
SVE is likely to change this at least in higher performance (non mobile) SKUs. Graviton3 on AWS has SVE-256 for example (but seems to only be able to issue 1.5 FMA's a cycle).
Regards, Freddie.
> ARM will increase the number of instructions decoded per clock, but the increase will be faster for ARM.
A few points of note. Firstly, decode is less important than it used to be due to micro-op caches. Both Intel and AMD employ them and they have been growing in size substantially on a generation-to-generation basis. (Interestingly, they are also used in some non-Apple ARM designs too.) This mitigates a lot of the issues associated with variable length decoding.
Secondly, just counting instructions is not sensible. Consider:
vfnmadd231pd zmm21,zmm23,ZMMWORD PTR [rsi+0xb280]
which is one x86 instruction but in ARM it would translate to four (two adds due to the size of the immediate, a load, and an FMA). Plus it would cost you a vector register to store the load and a GPR if you wanted to preserve the value of "rsi". On x86 that instruction is 10 bytes (and there is still room for a bigger immediate) whereas four ARM instructions are 16 bytes.
> Because of the difficulty of decoding variable length instructions in parallel, x86 will favor wider vectors than ARM. This is something we already see today with 512-bit vector operations (two of them per clock) on x86 and 128-bit vector operations (four of them per clock) on Apple’s M1.
SVE is likely to change this at least in higher performance (non mobile) SKUs. Graviton3 on AWS has SVE-256 for example (but seems to only be able to issue 1.5 FMA's a cycle).
Regards, Freddie.