Article: Power Delivery in a Modern Processor
By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), May 12, 2020 10:15 am
Room: Moderated Discussions
Concerning handling voltage droop, besides lowering frequency, it seems one could also exploit the fact that common case timing typically is less tight than worst case. This would require some kind of validity checking and replay but might provide another knob to handle voltage droop. Changes that reduce the power draw for doing the work at the cost of higher energy or more total work might also be possible. Forms of speculation that trade short term energy use for longer term energy savings (or performance) might be throttled, losing overall efficiency/performance but moderating short term power use.
Even for in-order processors, the execution width could be adjusted to moderate power draw; this has been used for thermal throttling, but it might apply to power supply issues. For out-of-order processors, there may be more opportunities for scheduling flexibility to temporarily reduce power use. Some load operations (and other high-energy operations) are not as latency sensitive and could be delayed a cycle or two with little performance impact. (While instruction commit does not use that much energy, delaying commit may have little performance impact when buffers are not nearly full.)
It seems that prediction and awareness of the fullness of capacitors could be used to narrow the timing gap of voltage regulation. E.g., predicting that a phase of high memory activity is near could motivate an increase in the power delivery. When a prediction of increased activity is wrong, it might even be useful to perform less useful or urgent work (e.g., more speculative prefetching, eager writebacks, ECC scrubbing) to exploit the temporarily low scarcity of power. (Work buffering/scheduling would seem to cooperate with energy buffering/scheduling.)
Side questions: Do upper layer MIM capacitors also provide a little thermal buffering and heat spreading? Could the extra step (expense) of a MIM layer be skipped on a per wafer basis (e.g., if early testing indicates most parts on a wafer would test into SKUs that do not require the extra power decoupling)? (I could see this being impractical. The design might not be amenable to such a change even if the extra decoupling is not especially useful for some SKUs. The production system might be so well optimized that it is cheaper to do the extra step than to introduce more complex scheduling of stages and more extensive buffering. The early detection of wafer-scale SKU probabilities may be impractical.)
Even for in-order processors, the execution width could be adjusted to moderate power draw; this has been used for thermal throttling, but it might apply to power supply issues. For out-of-order processors, there may be more opportunities for scheduling flexibility to temporarily reduce power use. Some load operations (and other high-energy operations) are not as latency sensitive and could be delayed a cycle or two with little performance impact. (While instruction commit does not use that much energy, delaying commit may have little performance impact when buffers are not nearly full.)
It seems that prediction and awareness of the fullness of capacitors could be used to narrow the timing gap of voltage regulation. E.g., predicting that a phase of high memory activity is near could motivate an increase in the power delivery. When a prediction of increased activity is wrong, it might even be useful to perform less useful or urgent work (e.g., more speculative prefetching, eager writebacks, ECC scrubbing) to exploit the temporarily low scarcity of power. (Work buffering/scheduling would seem to cooperate with energy buffering/scheduling.)
Side questions: Do upper layer MIM capacitors also provide a little thermal buffering and heat spreading? Could the extra step (expense) of a MIM layer be skipped on a per wafer basis (e.g., if early testing indicates most parts on a wafer would test into SKUs that do not require the extra power decoupling)? (I could see this being impractical. The design might not be amenable to such a change even if the extra decoupling is not especially useful for some SKUs. The production system might be so well optimized that it is cheaper to do the extra step than to introduce more complex scheduling of stages and more extensive buffering. The early detection of wafer-scale SKU probabilities may be impractical.)