By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), March 24, 2021 10:37 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on March 23, 2021 12:47 pm wrote:
> Andrew Clough (someone.delete@this.somewhere.com) on March 22, 2021 4:27 pm wrote:
>> https://millcomputing.com/docs/
>
> They've been banging on about The Mill seemingly forever, without
> ever building any hardware. It's fairly nutty stuff with obvious
> problems handling any branchy code, which is pretty much the definition
> of what a "general-purpose" computer needs to be able to do well.
While I am skeptical that any Mill hardware will ever be available for sale, I would not call such vaporware (the term has a connotation of intentional non-delivery and not just failure from unwise overambition). Any advantages of the design seem unlikely to counter the disadvantages of a new architecture. Since physical implementation is intended to be highly automated, performance will not match more highly tuned designs even if better than purely synthesized designs. In addition, I suspect perfect source-code compatibility will not be guaranteed; implementing a fully Linux-compatible subsystem may not be possible with single address space hardware (ASIDs attached to reduced virtual addresses would support homonyms and copy-on-write, but general synonym handling seems problematic).
I would also not call any aspects of the design "fairly nutty"; the Mill has a feel of a comprehensive design rather than a design by committee. E.g., the (random-read) queue (Belt) fits the concept of a forwarding network (the physical design is intended to be similar) with the advantage of limiting operand lifetimes (with the disadvantage of having to explicitly preserve long-lived values), but this also provides "register rotation" facilitating software pipelining of loops. The split fetch is a straightforward application of diverge from the middle storage allocation (slightly reminiscent of Heidi Pan's converge toward the middle Heads-and-Tails mechanism for code density) and theoretically allows doubling Icache capacity at a given latency.
With respect to branchy code, extreme width is intended to allow predication (via select operations) to manage short branches. (This design choice seems reasonable under the assumption that execution units are cheap — although they are, operand routing is not as cheap — and avoids removing even some generally predictable branches.) The use of trace scheduling and extended basic blocks (EBBs)— further extended by counting function calls as simple operations rather than as block exits — would allow significant management of branches; these methods are old VLIW developments. Unlike some VLIW design philosophies, dynamic branch prediction (exit prediction, i.e., predicting where the EBB will be left and what the next EBB will be) is assumed. The strictly static scheduling does force stalling on loads whose scheduled delay is smaller than the actual delay due to cache misses or inadequately early scheduling.
I believe the Mill has different weights for various tradeoffs relative to more conventional systems and adequate effectiveness could be achieved in general and some workloads would have substantial advantages.
Treating interrupts as function calls with arguments pushed from the interrupt source seems like an attractive feature for reducing I/O overhead. Deferring translation to farther out in the memory hierarchy seems attractive from an energy efficiency and near-core area perspective. Sharing translation tables may also provide some efficiencies. Using permission segments for certain memory regions that are considered part of a thread's context (so preloaded) seems helpful. (I suspect that the extra overhead of larger tags from using virtual addresses would still be less than the overhead, particularly in energy, from TLB accesses for cache hits. If it was not, compression might be used.)
Hardware COW of zero pages (requiring a hardware-readable free list) is an advantage that could be somewhat trivially applied to more conventional architectures and even the automatic zeroing of stack frames might not be very difficult to add to ARM, x86, RISC-V, or Power. The use of fine-grained valid bits (avoiding read-for-ownership) is entirely microarchitectural. (I am not convinced that byte-granular validity is universally worthwhile, particularly for L2 and beyond when implemented without invalid byte compression. Even with compression, I suspect that limited availability of byte-granular caching might be better, that byte granularity is usually used for contiguous string writes.)
The extensive use of skewed pipelining is integrated into the architectural orientation; wide decode is facilitated by allowing later-stage-execution operations to be decoded later (so earliest-stage operations are parsed out first and the parsing (and decoding) of later-executed operations can be delayed). Skewed pipelines can be beneficial in more conventional architectures (e.g., to reduce load-to-compute-use latency) and dynamic scheduling can use second-chance execution (e.g., branch conditions might be evaluated in two different pipeline stages, which is particularly attractive for early-load/late-compute skewed pipelines — counterflow pipeline is an extreme example of second-chance pipeline design). The Mill's skewed pipeline also allows data dependencies within an instruction bundle in some cases. (Cascaded ALUs are a special skewing case that constrains communication and would be more beneficial in very "wide" design.)
The hardware-managed save and restore of temporary state at function calls facilitates dynamic scheduling of these operations with buffering. The more architecturally explicit stack and "scratchpad" storage may also facilitate distinct storage management since the reuse distance is more easily predicted by hardware. (There are significant similarities to Itaniums Register Stack and NaT metadata storage, but Itanium never implemented dynamic scheduling of spill and fill — the last implementation did provide more GPRs so that overflow/underflow would be less common but I do not think "speculative" fill/spill was supported — which seems to be assumed for the Mill.)
While no mention has been made in the presentations, in theory specifying load delay (criticality) could be useful for bank conflict management or selective tag-data phased access (when scheduled load latency is higher than minimum L1 latency). (Itanium had hints for cache locality, but such probably not only had limited potential utility but even less actual use.)
The use of an intermediate software distribution format solves one problem faced with statically scheduled processors, variation in latency of operations, execution width, etc. While width-portable software can be generated, using a stop indicator to mark the boundary between potential issue groups, performance will generally not be very portable (e.g., software compiled for a 7-wide implementation may have many 7-wide issue groups that must be split into two cycles in a 6-wide implementation). An intermediate format also facilitates materialize at scheduled completion (used by early VLIW) rather than materialize at minimum latency (used by Itanium), reducing "register" pressure. This also facilitates bug fixes. This design does discourage flexible use of core heterogeneity since multiple versions of machine code would be needed, though hardware translation to a wider core (with lower used width) might be practical in some cases.
The specifying of pointer loads presents some opportunites for prefetching and with virtually tagged outer caches and shared (single address space) translation this could be pushed father out in the memory hierarchy. (E.g., a pointer load that misses in L1 might send a prefetch indirection request with the demand request to L2 (which might pass such to L3 and even to memory).) Virtually addressed caches also facilitate skewed associativity, which can provide lower execution time jitter or facilitate lower associativity for a given conflict rate (cuckoo caching can increase this benefit).
> Back around 2017 they were talking big about having a simulator and
> working on an fpga implementation. As far as I can tell absolutely
> none of that ever happened, at least in public, and it has all gone
> very quiet for 3 years or so.
>
> If it looks like a turkey, and it gobbles like a turkey ...
Since Mill Computing was previously entirely funded internally by the founders and the founders' time was not an expense, the early slow development (much of which was before public announcements) is somewhat understandable. There was also significant redirection of effort from prioritizing patents (due to first to file policy) and from compiler front-and-middle-end changes (initially a commercial compilation system was used then LLVM was adopted). I receive the impression that even with external funding, staffing has not exploded. Sweat equity was used for some work, but I received the impression that such did not double development speed. I also get the impression that there is a lot of architectural exploration work expected still; compiler developments will interact with general hardware architecture design, so progress should be expected to be somewhat slow. (Not so long ago, a comp.arch post mentioned that early simulations gave the strange result that an L1 instruction cache was not useful, at least for latency. This implies that significant exploration is still being done.)
I object more to the lack of reference to VLIW precedents. E.g., deferred and pick-up loads are very similar to advanced loads of Itanium and using predication to improve static scheduling under wide issue comes from early VLIW work. Part of this I suspect comes from ignorance (the principals have admitted to having reinvented concepts, wishing that they had known of them beforehand), but part of it probably also comes from marketing: revealing similarities to Itanium would discourage funding.
The claims of ten-fold power-performance-area benefit based on DSP efficiency applied to general purpose code also seems problematic. This is not helped by not specifying same-performance, same-area, or same-power; a conventional architecture might achieve ten-times better PPA by halving performance.
Architecturally, I am disappointed by the lack of speculation and lack of interest in multithreaded cores (static scheduling constrains but does not prohibit such). I also think that insufficient attention is being paid to the communication problem within the core and between threads.
scheduled delay: Technically, the delay can be somewhat dynamic by using an operation to mark when the load must materialize (called a "pick-up load") and by exploiting the fact that a function call saves and restores pending loads so that time spent in such function calls is not counted in the scheduled delay. Even an ordinary ("deferred", i.e., explicit delay) load can allow some limited extra latency hiding/memory-level parallelism since a stall cycles due to a late load result return from a load initiated before (or in parallel) with a second load are not counted in the scheduled delay of the second load.
effectiveness: I am making a distinction from performance since the lower physical design effort implies a significant performance penalty. Comparing to a design with similar customization for a more conventional architecture, the greater width of a Mill implementation would provide greater performance on workloads with very high ILP. If "general" workloads have good enough effectiveness (80%? 90%? relative to aggressive out-of-order) the advantage in high ILP workloads may be attractive when highly customized physical design is not practical — e.g., specific acceleration functions are included in the core. Even ARM does not provide mostly hardened/optimized designs with communication and decode hooks to an accelerator block; probably in part because many uses of acceleration can be managed as memory-mapped I/O (latency and bandwidth of communication with the core is often not extremely critical).
> Andrew Clough (someone.delete@this.somewhere.com) on March 22, 2021 4:27 pm wrote:
>> https://millcomputing.com/docs/
>
> They've been banging on about The Mill seemingly forever, without
> ever building any hardware. It's fairly nutty stuff with obvious
> problems handling any branchy code, which is pretty much the definition
> of what a "general-purpose" computer needs to be able to do well.
While I am skeptical that any Mill hardware will ever be available for sale, I would not call such vaporware (the term has a connotation of intentional non-delivery and not just failure from unwise overambition). Any advantages of the design seem unlikely to counter the disadvantages of a new architecture. Since physical implementation is intended to be highly automated, performance will not match more highly tuned designs even if better than purely synthesized designs. In addition, I suspect perfect source-code compatibility will not be guaranteed; implementing a fully Linux-compatible subsystem may not be possible with single address space hardware (ASIDs attached to reduced virtual addresses would support homonyms and copy-on-write, but general synonym handling seems problematic).
I would also not call any aspects of the design "fairly nutty"; the Mill has a feel of a comprehensive design rather than a design by committee. E.g., the (random-read) queue (Belt) fits the concept of a forwarding network (the physical design is intended to be similar) with the advantage of limiting operand lifetimes (with the disadvantage of having to explicitly preserve long-lived values), but this also provides "register rotation" facilitating software pipelining of loops. The split fetch is a straightforward application of diverge from the middle storage allocation (slightly reminiscent of Heidi Pan's converge toward the middle Heads-and-Tails mechanism for code density) and theoretically allows doubling Icache capacity at a given latency.
With respect to branchy code, extreme width is intended to allow predication (via select operations) to manage short branches. (This design choice seems reasonable under the assumption that execution units are cheap — although they are, operand routing is not as cheap — and avoids removing even some generally predictable branches.) The use of trace scheduling and extended basic blocks (EBBs)— further extended by counting function calls as simple operations rather than as block exits — would allow significant management of branches; these methods are old VLIW developments. Unlike some VLIW design philosophies, dynamic branch prediction (exit prediction, i.e., predicting where the EBB will be left and what the next EBB will be) is assumed. The strictly static scheduling does force stalling on loads whose scheduled delay is smaller than the actual delay due to cache misses or inadequately early scheduling.
I believe the Mill has different weights for various tradeoffs relative to more conventional systems and adequate effectiveness could be achieved in general and some workloads would have substantial advantages.
Treating interrupts as function calls with arguments pushed from the interrupt source seems like an attractive feature for reducing I/O overhead. Deferring translation to farther out in the memory hierarchy seems attractive from an energy efficiency and near-core area perspective. Sharing translation tables may also provide some efficiencies. Using permission segments for certain memory regions that are considered part of a thread's context (so preloaded) seems helpful. (I suspect that the extra overhead of larger tags from using virtual addresses would still be less than the overhead, particularly in energy, from TLB accesses for cache hits. If it was not, compression might be used.)
Hardware COW of zero pages (requiring a hardware-readable free list) is an advantage that could be somewhat trivially applied to more conventional architectures and even the automatic zeroing of stack frames might not be very difficult to add to ARM, x86, RISC-V, or Power. The use of fine-grained valid bits (avoiding read-for-ownership) is entirely microarchitectural. (I am not convinced that byte-granular validity is universally worthwhile, particularly for L2 and beyond when implemented without invalid byte compression. Even with compression, I suspect that limited availability of byte-granular caching might be better, that byte granularity is usually used for contiguous string writes.)
The extensive use of skewed pipelining is integrated into the architectural orientation; wide decode is facilitated by allowing later-stage-execution operations to be decoded later (so earliest-stage operations are parsed out first and the parsing (and decoding) of later-executed operations can be delayed). Skewed pipelines can be beneficial in more conventional architectures (e.g., to reduce load-to-compute-use latency) and dynamic scheduling can use second-chance execution (e.g., branch conditions might be evaluated in two different pipeline stages, which is particularly attractive for early-load/late-compute skewed pipelines — counterflow pipeline is an extreme example of second-chance pipeline design). The Mill's skewed pipeline also allows data dependencies within an instruction bundle in some cases. (Cascaded ALUs are a special skewing case that constrains communication and would be more beneficial in very "wide" design.)
The hardware-managed save and restore of temporary state at function calls facilitates dynamic scheduling of these operations with buffering. The more architecturally explicit stack and "scratchpad" storage may also facilitate distinct storage management since the reuse distance is more easily predicted by hardware. (There are significant similarities to Itaniums Register Stack and NaT metadata storage, but Itanium never implemented dynamic scheduling of spill and fill — the last implementation did provide more GPRs so that overflow/underflow would be less common but I do not think "speculative" fill/spill was supported — which seems to be assumed for the Mill.)
While no mention has been made in the presentations, in theory specifying load delay (criticality) could be useful for bank conflict management or selective tag-data phased access (when scheduled load latency is higher than minimum L1 latency). (Itanium had hints for cache locality, but such probably not only had limited potential utility but even less actual use.)
The use of an intermediate software distribution format solves one problem faced with statically scheduled processors, variation in latency of operations, execution width, etc. While width-portable software can be generated, using a stop indicator to mark the boundary between potential issue groups, performance will generally not be very portable (e.g., software compiled for a 7-wide implementation may have many 7-wide issue groups that must be split into two cycles in a 6-wide implementation). An intermediate format also facilitates materialize at scheduled completion (used by early VLIW) rather than materialize at minimum latency (used by Itanium), reducing "register" pressure. This also facilitates bug fixes. This design does discourage flexible use of core heterogeneity since multiple versions of machine code would be needed, though hardware translation to a wider core (with lower used width) might be practical in some cases.
The specifying of pointer loads presents some opportunites for prefetching and with virtually tagged outer caches and shared (single address space) translation this could be pushed father out in the memory hierarchy. (E.g., a pointer load that misses in L1 might send a prefetch indirection request with the demand request to L2 (which might pass such to L3 and even to memory).) Virtually addressed caches also facilitate skewed associativity, which can provide lower execution time jitter or facilitate lower associativity for a given conflict rate (cuckoo caching can increase this benefit).
> Back around 2017 they were talking big about having a simulator and
> working on an fpga implementation. As far as I can tell absolutely
> none of that ever happened, at least in public, and it has all gone
> very quiet for 3 years or so.
>
> If it looks like a turkey, and it gobbles like a turkey ...
Since Mill Computing was previously entirely funded internally by the founders and the founders' time was not an expense, the early slow development (much of which was before public announcements) is somewhat understandable. There was also significant redirection of effort from prioritizing patents (due to first to file policy) and from compiler front-and-middle-end changes (initially a commercial compilation system was used then LLVM was adopted). I receive the impression that even with external funding, staffing has not exploded. Sweat equity was used for some work, but I received the impression that such did not double development speed. I also get the impression that there is a lot of architectural exploration work expected still; compiler developments will interact with general hardware architecture design, so progress should be expected to be somewhat slow. (Not so long ago, a comp.arch post mentioned that early simulations gave the strange result that an L1 instruction cache was not useful, at least for latency. This implies that significant exploration is still being done.)
I object more to the lack of reference to VLIW precedents. E.g., deferred and pick-up loads are very similar to advanced loads of Itanium and using predication to improve static scheduling under wide issue comes from early VLIW work. Part of this I suspect comes from ignorance (the principals have admitted to having reinvented concepts, wishing that they had known of them beforehand), but part of it probably also comes from marketing: revealing similarities to Itanium would discourage funding.
The claims of ten-fold power-performance-area benefit based on DSP efficiency applied to general purpose code also seems problematic. This is not helped by not specifying same-performance, same-area, or same-power; a conventional architecture might achieve ten-times better PPA by halving performance.
Architecturally, I am disappointed by the lack of speculation and lack of interest in multithreaded cores (static scheduling constrains but does not prohibit such). I also think that insufficient attention is being paid to the communication problem within the core and between threads.
scheduled delay: Technically, the delay can be somewhat dynamic by using an operation to mark when the load must materialize (called a "pick-up load") and by exploiting the fact that a function call saves and restores pending loads so that time spent in such function calls is not counted in the scheduled delay. Even an ordinary ("deferred", i.e., explicit delay) load can allow some limited extra latency hiding/memory-level parallelism since a stall cycles due to a late load result return from a load initiated before (or in parallel) with a second load are not counted in the scheduled delay of the second load.
effectiveness: I am making a distinction from performance since the lower physical design effort implies a significant performance penalty. Comparing to a design with similar customization for a more conventional architecture, the greater width of a Mill implementation would provide greater performance on workloads with very high ILP. If "general" workloads have good enough effectiveness (80%? 90%? relative to aggressive out-of-order) the advantage in high ILP workloads may be attractive when highly customized physical design is not practical — e.g., specific acceleration functions are included in the core. Even ARM does not provide mostly hardened/optimized designs with communication and decode hooks to an accelerator block; probably in part because many uses of acceleration can be managed as memory-mapped I/O (latency and bandwidth of communication with the core is often not extremely critical).