By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), June 9, 2022 12:46 pm
Room: Moderated Discussions
anon2 (anon.delete@this.anon.com) on June 7, 2022 10:21 pm wrote:
[snip]
> ARM I think had no virtual memory and SPARC had SW loaded TLB too. I think Alpha also did but that
> may have been behind PALcode so arguably a better interface but the uarch implementation still a
> SW loaded TLB. Not sure about the 801/ROMP, maybe it had SW managed segments? So of the pre-90s
> crop of RISCs, AFAIK none had hardware loaded page tables (anyone have a counter-example?).
The Fairchild Clipper may not qualify as RISC (variable length instructions with 16-bit parcel, 16 application GPRs) by John Mashey's standard ("Finally, to be fair, let me add the two cases that I knew of that were more on the borderline: i960 and Clipper"), but in 1986 it used Cache-MMU chips that walked the classic 32-bit hierarchical page table (i.e., 4KiB pages, 32-bit PTE — it did have separate supervisor and user root pointers and since data and instruction caches were separate one could, in theory, configure separate instruction and data address spaces).
I suspect the hardware area overhead for a table walker for a hierarchical page table would not have been extraordinarily great in 1986. A hierachical page table is not well-suited to MIPS' support for power-of-four page size choices. (Caching intermediate table nodes also seems to have been a late-appearing feature; linear/flat virtual page tables used TLB-based caching fairly early — which works well with software TLB fill — but a similar mechanism was not implemented for x86, it seems, until the 2000s and AMD initially implemented a special cache addressed by the physical memory address.)
I think there was more uncertainty in the late 1980s both about page size (Alpha tried to support growing page size and, if I recall correctly, even tried to push such as a solution to avoid adding another look-up level) and page table formats (there may have been expectations that 64-bit address spaces with substantial inter-program/thread memory sharing for communication would introduce sparse address space use, for which ordinary hierarchical page tables are less suited especially without caching of intermediate table nodes).
I am not entirely convinced that branch delay slots were a bad design choice. Scanning ahead in the instruction stream and cache-fill instruction reordering have been proposed as microarchitectural methods to provide similar benefits, but aside from designers not thinking of those (I think CRISP did some runahead branch processing) I do not know what the area/complexity tradeoffs would have been. (Since compilers could not always usefully or even semi-usefully fill delay slots and the fall through path would be usefully executed in many cases, the actual benefit of delay slots was smaller than the ideal benefit, but I received the impression that even the actual performance benefit was significant at the time.)
(The benefit of load delay slots seems more difficult to measure. The implementation cost of detecting a load-data hazard and stalling the pipeline one cycle (i.e., dynamically inserting a nop) may have hurt frequency. Delayed loads had no architecturally persistent effect — legacy software would run at a modest relative performance penalty on a microarchitecture without load delay — so that wrinkle in early MIPS implementations is not given much attention.)
A more flexible software distribution format would have allowed delayed branches to be used without long-term architectural commitment; rescheduling binaries for different pipelines had been proposed. Even a very thin translation layer could provide substantial flexibility in encoding; one might even guarantee in-place translation at the granularity of functions (or "pages") at least for a 'generation' of implementations. (Of course, a flexible software distribution format would also facilitate competition; portable software may be viewed as bad for business.)
> Which as you say is fine and likely the right choice for the time. Architecting a hardware TLB reload facility
> takes little more than specifying page table formats and memory access rules, and small number (possibly one)
> of additional registers and mode bits to specify page table base and control some MMU operation modes. So
> it was completely reasonable to leave that out while implementations were doing SW reloading anyway.
Specifying a page table format after software has been using one or more formats may displease those whose format was not chosen (and even then software implementations had different cost tradeoffs, so a good hardware format might reasonably be different).
The TLB design will also influence page table and PTE formats. If the early TLBs support multiple page sizes (even if fixed for address ranges, like Itanium), this will influence the formats. Even a single page size TLB would influence page table format choices, encouraging the use of that size hierarchical page table levels.
[snip]
> ARM I think had no virtual memory and SPARC had SW loaded TLB too. I think Alpha also did but that
> may have been behind PALcode so arguably a better interface but the uarch implementation still a
> SW loaded TLB. Not sure about the 801/ROMP, maybe it had SW managed segments? So of the pre-90s
> crop of RISCs, AFAIK none had hardware loaded page tables (anyone have a counter-example?).
The Fairchild Clipper may not qualify as RISC (variable length instructions with 16-bit parcel, 16 application GPRs) by John Mashey's standard ("Finally, to be fair, let me add the two cases that I knew of that were more on the borderline: i960 and Clipper"), but in 1986 it used Cache-MMU chips that walked the classic 32-bit hierarchical page table (i.e., 4KiB pages, 32-bit PTE — it did have separate supervisor and user root pointers and since data and instruction caches were separate one could, in theory, configure separate instruction and data address spaces).
I suspect the hardware area overhead for a table walker for a hierarchical page table would not have been extraordinarily great in 1986. A hierachical page table is not well-suited to MIPS' support for power-of-four page size choices. (Caching intermediate table nodes also seems to have been a late-appearing feature; linear/flat virtual page tables used TLB-based caching fairly early — which works well with software TLB fill — but a similar mechanism was not implemented for x86, it seems, until the 2000s and AMD initially implemented a special cache addressed by the physical memory address.)
I think there was more uncertainty in the late 1980s both about page size (Alpha tried to support growing page size and, if I recall correctly, even tried to push such as a solution to avoid adding another look-up level) and page table formats (there may have been expectations that 64-bit address spaces with substantial inter-program/thread memory sharing for communication would introduce sparse address space use, for which ordinary hierarchical page tables are less suited especially without caching of intermediate table nodes).
I am not entirely convinced that branch delay slots were a bad design choice. Scanning ahead in the instruction stream and cache-fill instruction reordering have been proposed as microarchitectural methods to provide similar benefits, but aside from designers not thinking of those (I think CRISP did some runahead branch processing) I do not know what the area/complexity tradeoffs would have been. (Since compilers could not always usefully or even semi-usefully fill delay slots and the fall through path would be usefully executed in many cases, the actual benefit of delay slots was smaller than the ideal benefit, but I received the impression that even the actual performance benefit was significant at the time.)
(The benefit of load delay slots seems more difficult to measure. The implementation cost of detecting a load-data hazard and stalling the pipeline one cycle (i.e., dynamically inserting a nop) may have hurt frequency. Delayed loads had no architecturally persistent effect — legacy software would run at a modest relative performance penalty on a microarchitecture without load delay — so that wrinkle in early MIPS implementations is not given much attention.)
A more flexible software distribution format would have allowed delayed branches to be used without long-term architectural commitment; rescheduling binaries for different pipelines had been proposed. Even a very thin translation layer could provide substantial flexibility in encoding; one might even guarantee in-place translation at the granularity of functions (or "pages") at least for a 'generation' of implementations. (Of course, a flexible software distribution format would also facilitate competition; portable software may be viewed as bad for business.)
> Which as you say is fine and likely the right choice for the time. Architecting a hardware TLB reload facility
> takes little more than specifying page table formats and memory access rules, and small number (possibly one)
> of additional registers and mode bits to specify page table base and control some MMU operation modes. So
> it was completely reasonable to leave that out while implementations were doing SW reloading anyway.
Specifying a page table format after software has been using one or more formats may displease those whose format was not chosen (and even then software implementations had different cost tradeoffs, so a good hardware format might reasonably be different).
The TLB design will also influence page table and PTE formats. If the early TLBs support multiple page sizes (even if fixed for address ranges, like Itanium), this will influence the formats. Even a single page size TLB would influence page table format choices, encouraging the use of that size hierarchical page table levels.