Article: Power Delivery in a Modern Processor
By: Travis Downs (travis.downs.delete@this.gmail.com), May 12, 2020 1:25 pm
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on May 12, 2020 10:15 am wrote:
> Even for in-order processors, the execution width could be adjusted to moderate power draw; this has
> been used for thermal throttling, but it might apply to power supply issues. For out-of-order processors,
> there may be more opportunities for scheduling flexibility to temporarily reduce power use.
I don't know if you already saw this, but I find that modern Intel chips do a very coarse grained version of this.
That is, they seem to throttle dispatch for all instructions to 1/4th the normal rate as long as *any* wide (vector) instruction is in the scheduler. This was previously reported by Agner and others to be a warmup period where vector instructions were executed on narrower half-width vector EUs at a reduced rate (but no one knew why it was 1/4 rather than say 1/2), but I believe that is incorrect: the full width is available but instructions dispatch slowly to limit the worst case droop.
The effect is very specific: it basically rounds the latency of every instruction up to the next multiple of 4, so a latency 3 multiply takes 4 cycles now, a 5-cycle load is 8 cycles, etc. Probably implemented by just dispatching every 4th cycle or similar, as well as counting SIMD instructions coming in and out of the scheduler to avoid this penalty if there are no SIMD instructions imminent.
This continues for a while until the new voltage level is achieved and then full speed execution can continue.
Previous RWT thread.
> Even for in-order processors, the execution width could be adjusted to moderate power draw; this has
> been used for thermal throttling, but it might apply to power supply issues. For out-of-order processors,
> there may be more opportunities for scheduling flexibility to temporarily reduce power use.
I don't know if you already saw this, but I find that modern Intel chips do a very coarse grained version of this.
That is, they seem to throttle dispatch for all instructions to 1/4th the normal rate as long as *any* wide (vector) instruction is in the scheduler. This was previously reported by Agner and others to be a warmup period where vector instructions were executed on narrower half-width vector EUs at a reduced rate (but no one knew why it was 1/4 rather than say 1/2), but I believe that is incorrect: the full width is available but instructions dispatch slowly to limit the worst case droop.
The effect is very specific: it basically rounds the latency of every instruction up to the next multiple of 4, so a latency 3 multiply takes 4 cycles now, a 5-cycle load is 8 cycles, etc. Probably implemented by just dispatching every 4th cycle or similar, as well as counting SIMD instructions coming in and out of the scheduler to avoid this penalty if there are no SIMD instructions imminent.
This continues for a while until the new voltage level is achieved and then full speed execution can continue.
Previous RWT thread.