By: --- (---.delete@this.redheron.com), June 2, 2022 1:25 pm
Room: Moderated Discussions
Freddie (freddie.delete@this.witherden.org) on June 2, 2022 7:22 am wrote:
> Peter Lewis (peter.delete@this.notyahoo.com) on June 2, 2022 1:22 am wrote:
> > ARM will increase the number of instructions decoded per clock, but the increase will be faster for ARM.
>
> A few points of note. Firstly, decode is less important than it used to be due to micro-op
> caches. Both Intel and AMD employ them and they have been growing in size substantially on
> a generation-to-generation basis. (Interestingly, they are also used in some non-Apple ARM
> designs too.) This mitigates a lot of the issues associated with variable length decoding.
Not just non-Apple ARM.
To go by patents (this is something hard to test experimentally) Apple have at least two additional tiers of "instruction micro-cache-like-things" between standard instruction cache on one side and loop buffer on the other side.
The *real win* (IMHO) in the current x86 and newest ARM versions, and apparently not yet implemented by Apple, is the elastic queue between Decode/Feed Decoded Instructions from micro-cache and Rename. This allows for one more balancing point between temporary bursts of excess supply and demand.
Apple's current scheme (tight coupling between Decode and Rename) may shave one cycle off recovery from any sort of flush/resteer, but it means that downstream doesn't get a bandwidth win from instructions discarded at Decode (either full discard like NOPs and non-conditional simple branches, or "effective" discards like fusing).
What *I* (and sure, what do I know?!) suspect would be Apple's best path going forward would be to
- widen fetch even further and allow for Decode of around 10 wide (plus feeding 10 or even 12 from the various micro-cache-like entities)
- a queue between Decode and Rename
- Rename that remains 8-wide (for now)
That means the hot, expensive, complex Rename can stay unchanged for now, but it will frequently run the full 8-wide rather than the many cases today where it runs substantially less because Decode threw away some instructions.
A next step would be to have Rename more flexible so rather than a fixed 8-wide, there would be various specific limits (6 physical registers allocated, 4 fp physical registers allocated, ...) and to some extent these could be mixed and matched so that for the right mix of instructions perhaps 10 or 11 instructions could be "Renamed" (ie have their allocations performed).
My guess is this is feasible (tough tough, and it may require Rename to be split over two cycles, though some of the work could be done at Decode) but I'm unaware of any CPU to actually implement it, so there may be trickiness/impossibility I am not seeing.
However the first step -- run your 8-wide Rename 8-wide for many more cycles -- is, I believe perfectly feasible and not even that tough.
> Secondly, just counting instructions is not sensible. Consider:
>
> vfnmadd231pd zmm21,zmm23,ZMMWORD PTR [rsi+0xb280]
>
> which is one x86 instruction but in ARM it would translate to four (two adds due to the size of
> the immediate, a load, and an FMA). Plus it would cost you a vector register to store the load
> and a GPR if you wanted to preserve the value of "rsi". On x86 that instruction is 10 bytes (and
> there is still room for a bigger immediate) whereas four ARM instructions are 16 bytes.
>
> > Because of the difficulty of decoding variable length instructions in parallel, x86 will favor wider
> > vectors than ARM. This is something we already see today with 512-bit vector operations (two of them
> > per clock) on x86 and 128-bit vector operations (four of them per clock) on Apple’s M1.
>
> SVE is likely to change this at least in higher performance (non mobile) SKUs. Graviton3
> on AWS has SVE-256 for example (but seems to only be able to issue 1.5 FMA's a cycle).
>
> Regards, Freddie.
> Peter Lewis (peter.delete@this.notyahoo.com) on June 2, 2022 1:22 am wrote:
> > ARM will increase the number of instructions decoded per clock, but the increase will be faster for ARM.
>
> A few points of note. Firstly, decode is less important than it used to be due to micro-op
> caches. Both Intel and AMD employ them and they have been growing in size substantially on
> a generation-to-generation basis. (Interestingly, they are also used in some non-Apple ARM
> designs too.) This mitigates a lot of the issues associated with variable length decoding.
Not just non-Apple ARM.
To go by patents (this is something hard to test experimentally) Apple have at least two additional tiers of "instruction micro-cache-like-things" between standard instruction cache on one side and loop buffer on the other side.
The *real win* (IMHO) in the current x86 and newest ARM versions, and apparently not yet implemented by Apple, is the elastic queue between Decode/Feed Decoded Instructions from micro-cache and Rename. This allows for one more balancing point between temporary bursts of excess supply and demand.
Apple's current scheme (tight coupling between Decode and Rename) may shave one cycle off recovery from any sort of flush/resteer, but it means that downstream doesn't get a bandwidth win from instructions discarded at Decode (either full discard like NOPs and non-conditional simple branches, or "effective" discards like fusing).
What *I* (and sure, what do I know?!) suspect would be Apple's best path going forward would be to
- widen fetch even further and allow for Decode of around 10 wide (plus feeding 10 or even 12 from the various micro-cache-like entities)
- a queue between Decode and Rename
- Rename that remains 8-wide (for now)
That means the hot, expensive, complex Rename can stay unchanged for now, but it will frequently run the full 8-wide rather than the many cases today where it runs substantially less because Decode threw away some instructions.
A next step would be to have Rename more flexible so rather than a fixed 8-wide, there would be various specific limits (6 physical registers allocated, 4 fp physical registers allocated, ...) and to some extent these could be mixed and matched so that for the right mix of instructions perhaps 10 or 11 instructions could be "Renamed" (ie have their allocations performed).
My guess is this is feasible (tough tough, and it may require Rename to be split over two cycles, though some of the work could be done at Decode) but I'm unaware of any CPU to actually implement it, so there may be trickiness/impossibility I am not seeing.
However the first step -- run your 8-wide Rename 8-wide for many more cycles -- is, I believe perfectly feasible and not even that tough.
> Secondly, just counting instructions is not sensible. Consider:
>
> vfnmadd231pd zmm21,zmm23,ZMMWORD PTR [rsi+0xb280]
>
> which is one x86 instruction but in ARM it would translate to four (two adds due to the size of
> the immediate, a load, and an FMA). Plus it would cost you a vector register to store the load
> and a GPR if you wanted to preserve the value of "rsi". On x86 that instruction is 10 bytes (and
> there is still room for a bigger immediate) whereas four ARM instructions are 16 bytes.
>
> > Because of the difficulty of decoding variable length instructions in parallel, x86 will favor wider
> > vectors than ARM. This is something we already see today with 512-bit vector operations (two of them
> > per clock) on x86 and 128-bit vector operations (four of them per clock) on Apple’s M1.
>
> SVE is likely to change this at least in higher performance (non mobile) SKUs. Graviton3
> on AWS has SVE-256 for example (but seems to only be able to issue 1.5 FMA's a cycle).
>
> Regards, Freddie.