By: Simon Farnsworth (simon.delete@this.farnz.org.uk), June 10, 2022 1:58 am
Room: Moderated Discussions
Paul A. Clayton (paaronclayton.delete@this.gmail.com) on June 9, 2022 12:46 pm wrote:
[snip]
>
> I am not entirely convinced that branch delay slots were a bad design choice. Scanning ahead in the instruction
> stream and cache-fill instruction reordering have been proposed as microarchitectural methods to provide
> similar benefits, but aside from designers not thinking of those (I think CRISP did some runahead branch
> processing) I do not know what the area/complexity tradeoffs would have been. (Since compilers could not
> always usefully or even semi-usefully fill delay slots and the fall through path would be usefully executed
> in many cases, the actual benefit of delay slots was smaller than the ideal benefit, but I received the
> impression that even the actual performance benefit was significant at the time.)
>
> (The benefit of load delay slots seems more difficult to measure. The implementation cost of
> detecting a load-data hazard and stalling the pipeline one cycle (i.e., dynamically inserting
> a nop) may have hurt frequency. Delayed loads had no architecturally persistent effect — legacy
> software would run at a modest relative performance penalty on a microarchitecture without load
> delay — so that wrinkle in early MIPS implementations is not given much attention.)
>
> A more flexible software distribution format would have allowed delayed branches to be used without
> long-term architectural commitment; rescheduling binaries for different pipelines had been proposed.
> Even a very thin translation layer could provide substantial flexibility in encoding; one might
> even guarantee in-place translation at the granularity of functions (or "pages") at least for
> a 'generation' of implementations. (Of course, a flexible software distribution format would also
> facilitate competition; portable software may be viewed as bad for business.)
>
One of the lessons we have learnt over the last 40 years is that exposing microarchitectural details (ARM PC offset, delay slots of any form) in the architecture ends up creating a legacy problem.
While the first implementation benefits from those details being exposed, later implementations have to either be moderately incompatible with the original (e.g. ARM PC offset, where StrongARM did something different), or have to handle both their own innate complexity (e.g. needing two instructions to execute after a branch, not just one), and have to allow for the original implementation's oddity (the single delay slot).
It's worth noting that the GNU assembler for MIPS is able to reorder for you to move instructions into the delay slot where possible, or add a NOP if not. An alternate history has the interchange format for MIPS binaries being a "byte code" format, where all the instructions are MIPS instructions, addresses in the binary are symbolic, and on installation, the installer rewrites the byte code to exploit this CPU's delay slots and fixes up all addressing.
This doesn't allow for competition - your byte code is still very closely tied to the MIPS architecture - unless the competition basically implements your architecture on a different microarchitecture, but does push some of the complexity into software, enabling you to have simpler hardware that doesn't handle hazards. In theory, you'd be able to handle more complex hazards in software, too - e.g. if the pipeline grows, software can try to find two instructions to fill the delay slots, rather than needing hardware to do that.
And, of course, it leaves you with an architecture design that's not introducing fresh issues for later chip designs - you can implement a chip that just needs software to fix up the symbolic addressing.
[snip]
>
> I am not entirely convinced that branch delay slots were a bad design choice. Scanning ahead in the instruction
> stream and cache-fill instruction reordering have been proposed as microarchitectural methods to provide
> similar benefits, but aside from designers not thinking of those (I think CRISP did some runahead branch
> processing) I do not know what the area/complexity tradeoffs would have been. (Since compilers could not
> always usefully or even semi-usefully fill delay slots and the fall through path would be usefully executed
> in many cases, the actual benefit of delay slots was smaller than the ideal benefit, but I received the
> impression that even the actual performance benefit was significant at the time.)
>
> (The benefit of load delay slots seems more difficult to measure. The implementation cost of
> detecting a load-data hazard and stalling the pipeline one cycle (i.e., dynamically inserting
> a nop) may have hurt frequency. Delayed loads had no architecturally persistent effect — legacy
> software would run at a modest relative performance penalty on a microarchitecture without load
> delay — so that wrinkle in early MIPS implementations is not given much attention.)
>
> A more flexible software distribution format would have allowed delayed branches to be used without
> long-term architectural commitment; rescheduling binaries for different pipelines had been proposed.
> Even a very thin translation layer could provide substantial flexibility in encoding; one might
> even guarantee in-place translation at the granularity of functions (or "pages") at least for
> a 'generation' of implementations. (Of course, a flexible software distribution format would also
> facilitate competition; portable software may be viewed as bad for business.)
>
One of the lessons we have learnt over the last 40 years is that exposing microarchitectural details (ARM PC offset, delay slots of any form) in the architecture ends up creating a legacy problem.
While the first implementation benefits from those details being exposed, later implementations have to either be moderately incompatible with the original (e.g. ARM PC offset, where StrongARM did something different), or have to handle both their own innate complexity (e.g. needing two instructions to execute after a branch, not just one), and have to allow for the original implementation's oddity (the single delay slot).
It's worth noting that the GNU assembler for MIPS is able to reorder for you to move instructions into the delay slot where possible, or add a NOP if not. An alternate history has the interchange format for MIPS binaries being a "byte code" format, where all the instructions are MIPS instructions, addresses in the binary are symbolic, and on installation, the installer rewrites the byte code to exploit this CPU's delay slots and fixes up all addressing.
This doesn't allow for competition - your byte code is still very closely tied to the MIPS architecture - unless the competition basically implements your architecture on a different microarchitecture, but does push some of the complexity into software, enabling you to have simpler hardware that doesn't handle hazards. In theory, you'd be able to handle more complex hazards in software, too - e.g. if the pipeline grows, software can try to find two instructions to fill the delay slots, rather than needing hardware to do that.
And, of course, it leaves you with an architecture design that's not introducing fresh issues for later chip designs - you can implement a chip that just needs software to fix up the symbolic addressing.