By: ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com), June 5, 2022 12:32 am
Room: Moderated Discussions
Peter Lewis (peter.delete@this.notyahoo.com) on June 4, 2022 6:56 pm wrote:
> The performance of filling the µop cache still matters, which is why Intel recently increased the
> number of instructions decoded per clock from 4 to 6. If the µop cache was big enough to hold all
> the performance critical code, there would be no need to do that. For the same reason you can’t
> increase the bandwidth out of a data cache without also increasing DRAM bandwidth, you can’t increase
> the bandwidth out of a µop cache without increasing the bandwidth into it. Maybe you are thinking
> the µop cache will eventually become so big that the hit rate will approach 100%. I don’t know
> if that is possible. The code size of every type of software seems to grow without limit, but maybe
> the amount of code that needs to be in cache at one time does have some limit.
As soon as the number of directly decoded x86 instructions is above a certain limit (=N) it will be efficient to store pre-decoded x86 instructions in an L2 µop cache. The problem is that I don't know what the value of N is.
It is probable that the number of "x86 decode cores" filling the L1-or-L2 µop caches in a CPU will be smaller than the number of CPU cores. A reason for this is that x86 decoders are being utilized less than 20% of the time and thus a single "x86 decoder core" can serve multiple µop caches (time-division multiplexing). A problem is that the term "x86 decode core" is an undefined term in the year 2022, which makes the concept of such a core hard to communicate to other people.
> > If a RISC CPU has a µop cache, the space-efficiency of the µop cache
> > can be better than the space-efficiency of RISC instructions
>
> If it is possible to make the encoding in the µop cache more space efficient than the encoding
> of RISC instructions, why didn’t the RISC processor use the µops as its instruction set?
Because it is mathematically impossible to compress 1 TiB of random binary data into 32 KiB of space and be able to load/store 64-byte randomly selected chunks from/to the 32 KiB space with a throughput of 1 chunk per CPU cycle.
-atom
> > It is possible that in the near future it will become clear that CPU performance is
> > determined by the number of [conditional] branches successfully predicted per cycle
>
> I agree accurately predicting multiple branches per cycle is one of the most important
> factors for CPU performance as the number of instructions processed per cycle increases.
> The performance of filling the µop cache still matters, which is why Intel recently increased the
> number of instructions decoded per clock from 4 to 6. If the µop cache was big enough to hold all
> the performance critical code, there would be no need to do that. For the same reason you can’t
> increase the bandwidth out of a data cache without also increasing DRAM bandwidth, you can’t increase
> the bandwidth out of a µop cache without increasing the bandwidth into it. Maybe you are thinking
> the µop cache will eventually become so big that the hit rate will approach 100%. I don’t know
> if that is possible. The code size of every type of software seems to grow without limit, but maybe
> the amount of code that needs to be in cache at one time does have some limit.
As soon as the number of directly decoded x86 instructions is above a certain limit (=N) it will be efficient to store pre-decoded x86 instructions in an L2 µop cache. The problem is that I don't know what the value of N is.
It is probable that the number of "x86 decode cores" filling the L1-or-L2 µop caches in a CPU will be smaller than the number of CPU cores. A reason for this is that x86 decoders are being utilized less than 20% of the time and thus a single "x86 decoder core" can serve multiple µop caches (time-division multiplexing). A problem is that the term "x86 decode core" is an undefined term in the year 2022, which makes the concept of such a core hard to communicate to other people.
> > If a RISC CPU has a µop cache, the space-efficiency of the µop cache
> > can be better than the space-efficiency of RISC instructions
>
> If it is possible to make the encoding in the µop cache more space efficient than the encoding
> of RISC instructions, why didn’t the RISC processor use the µops as its instruction set?
Because it is mathematically impossible to compress 1 TiB of random binary data into 32 KiB of space and be able to load/store 64-byte randomly selected chunks from/to the 32 KiB space with a throughput of 1 chunk per CPU cycle.
-atom
> > It is possible that in the near future it will become clear that CPU performance is
> > determined by the number of [conditional] branches successfully predicted per cycle
>
> I agree accurately predicting multiple branches per cycle is one of the most important
> factors for CPU performance as the number of instructions processed per cycle increases.