By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), August 28, 2022 5:50 pm
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on August 28, 2022 3:14 pm wrote:
> --- (---.delete@this.redheron.com) on August 28, 2022 2:24 pm wrote:
[snip]
>> It wouldn't be absolutely crazy if you're trying to save energy (I wouldn't roll my eyes if
>> I learned that Apple's small core likewise can Fetch up to 16 instructions a cycle -- might
>> as well get as much useful as you can in one gulp, then sleep Fetch for two or three cycles);
>
> That seems like the opposite of good energy efficiency to me. I doubt the small
> core would do that and also surprised about the big core if that is true of it.
If I understand correctly, modern processes prefer 128-bit-wide SRAM arrays for area and power efficiency. This might allow a 4-instruction fetch to be energy efficient with two-wide decode compared to two 2-instruction fetches even with way memoization. Using partial tag way prediction without memoization would increase the per-access overhead.
(A tiny core might even have a unified L1 cache, in which case wide fetch would free cycles for data accesses, though banking could provide parallel access. With a unified L1, full sub-array width accesses for instructions and data may be desirable.)
Having fetch run ahead of decode may hide fetch glitches (e.g., two instructions may straddle sub-arrays, so one sub-array access per cycle for energy efficiency might only provide one instruction in one cycle, even some cache miss latency might be hidden be instruction buffering).
If all of the instructions in a wide fetch were executed, a single wide fetch would (presumably) be more energy efficient.
I would be skeptical that 64-byte fetch would be the most energy efficient or even have the best energy-delay; accessing multiple sub-arrays adds energy cost and there might be a smaller average fraction of used instructions with such wide fetch. (One might be able to use some of those instructions on branch mispredictions if they were stored in a small buffer, but branch mispredictions are relatively rare so I suspect such would not be worthwhile for energy efficiency.)
It would be nice if someone with actual knowledge would chime in.
> --- (---.delete@this.redheron.com) on August 28, 2022 2:24 pm wrote:
[snip]
>> It wouldn't be absolutely crazy if you're trying to save energy (I wouldn't roll my eyes if
>> I learned that Apple's small core likewise can Fetch up to 16 instructions a cycle -- might
>> as well get as much useful as you can in one gulp, then sleep Fetch for two or three cycles);
>
> That seems like the opposite of good energy efficiency to me. I doubt the small
> core would do that and also surprised about the big core if that is true of it.
If I understand correctly, modern processes prefer 128-bit-wide SRAM arrays for area and power efficiency. This might allow a 4-instruction fetch to be energy efficient with two-wide decode compared to two 2-instruction fetches even with way memoization. Using partial tag way prediction without memoization would increase the per-access overhead.
(A tiny core might even have a unified L1 cache, in which case wide fetch would free cycles for data accesses, though banking could provide parallel access. With a unified L1, full sub-array width accesses for instructions and data may be desirable.)
Having fetch run ahead of decode may hide fetch glitches (e.g., two instructions may straddle sub-arrays, so one sub-array access per cycle for energy efficiency might only provide one instruction in one cycle, even some cache miss latency might be hidden be instruction buffering).
If all of the instructions in a wide fetch were executed, a single wide fetch would (presumably) be more energy efficient.
I would be skeptical that 64-byte fetch would be the most energy efficient or even have the best energy-delay; accessing multiple sub-arrays adds energy cost and there might be a smaller average fraction of used instructions with such wide fetch. (One might be able to use some of those instructions on branch mispredictions if they were stored in a small buffer, but branch mispredictions are relatively rare so I suspect such would not be worthwhile for energy efficiency.)
It would be nice if someone with actual knowledge would chime in.