By: Peter Lewis (peter.delete@this.notyahoo.com), June 6, 2022 4:22 am
Room: Moderated Discussions
> It is probable that the number of "x86 decode cores" filling the L1-or-L2 µop caches in a CPU will be smaller than the number of
> CPU cores. A reason for this is that x86 decoders are being utilized less than 20% of the time and thus a single "x86 decoder core"
> can serve multiple µop caches (time-division multiplexing).
One way to implement your idea would be to have an L2 µop cache that is shared between multiple CPU cores. With OpenMP, different CPU cores are often executing the same code. The main issue I see with this idea is that µops are much bigger than the original x86 instructions (72 bits to 118 bits per µop). Caching µops takes more cache space and more cache power than caching x86 instructions. Instead of making an L2 cache for µops, you could get a better hit rate by caching more x86 instructions. It comes down to a tradeoff between increasing the instruction cache hit rate vs adding more hardware for instruction decoding. I guess the best choice depends in the amount of power used by RAM vs power used by instruction decode logic and the number of instructions being processed per clock.
> CPU cores. A reason for this is that x86 decoders are being utilized less than 20% of the time and thus a single "x86 decoder core"
> can serve multiple µop caches (time-division multiplexing).
One way to implement your idea would be to have an L2 µop cache that is shared between multiple CPU cores. With OpenMP, different CPU cores are often executing the same code. The main issue I see with this idea is that µops are much bigger than the original x86 instructions (72 bits to 118 bits per µop). Caching µops takes more cache space and more cache power than caching x86 instructions. Instead of making an L2 cache for µops, you could get a better hit rate by caching more x86 instructions. It comes down to a tradeoff between increasing the instruction cache hit rate vs adding more hardware for instruction decoding. I guess the best choice depends in the amount of power used by RAM vs power used by instruction decode logic and the number of instructions being processed per clock.