never-ware != vaporware (at least in connotation)

By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), March 24, 2021 10:37 am
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on March 23, 2021 12:47 pm wrote:
> Andrew Clough (someone.delete@this.somewhere.com) on March 22, 2021 4:27 pm wrote:
>> https://millcomputing.com/docs/
>
> They've been banging on about The Mill seemingly forever, without
> ever building any hardware. It's fairly nutty stuff with obvious
> problems handling any branchy code, which is pretty much the definition
> of what a "general-purpose" computer needs to be able to do well.

While I am skeptical that any Mill hardware will ever be available for sale, I would not call such vaporware (the term has a connotation of intentional non-delivery and not just failure from unwise overambition). Any advantages of the design seem unlikely to counter the disadvantages of a new architecture. Since physical implementation is intended to be highly automated, performance will not match more highly tuned designs even if better than purely synthesized designs. In addition, I suspect perfect source-code compatibility will not be guaranteed; implementing a fully Linux-compatible subsystem may not be possible with single address space hardware (ASIDs attached to reduced virtual addresses would support homonyms and copy-on-write, but general synonym handling seems problematic).

I would also not call any aspects of the design "fairly nutty"; the Mill has a feel of a comprehensive design rather than a design by committee. E.g., the (random-read) queue (Belt) fits the concept of a forwarding network (the physical design is intended to be similar) with the advantage of limiting operand lifetimes (with the disadvantage of having to explicitly preserve long-lived values), but this also provides "register rotation" facilitating software pipelining of loops. The split fetch is a straightforward application of diverge from the middle storage allocation (slightly reminiscent of Heidi Pan's converge toward the middle Heads-and-Tails mechanism for code density) and theoretically allows doubling Icache capacity at a given latency.

With respect to branchy code, extreme width is intended to allow predication (via select operations) to manage short branches. (This design choice seems reasonable under the assumption that execution units are cheap — although they are, operand routing is not as cheap — and avoids removing even some generally predictable branches.) The use of trace scheduling and extended basic blocks (EBBs)— further extended by counting function calls as simple operations rather than as block exits — would allow significant management of branches; these methods are old VLIW developments. Unlike some VLIW design philosophies, dynamic branch prediction (exit prediction, i.e., predicting where the EBB will be left and what the next EBB will be) is assumed. The strictly static scheduling does force stalling on loads whose scheduled delay is smaller than the actual delay due to cache misses or inadequately early scheduling.

I believe the Mill has different weights for various tradeoffs relative to more conventional systems and adequate effectiveness could be achieved in general and some workloads would have substantial advantages.

Treating interrupts as function calls with arguments pushed from the interrupt source seems like an attractive feature for reducing I/O overhead. Deferring translation to farther out in the memory hierarchy seems attractive from an energy efficiency and near-core area perspective. Sharing translation tables may also provide some efficiencies. Using permission segments for certain memory regions that are considered part of a thread's context (so preloaded) seems helpful. (I suspect that the extra overhead of larger tags from using virtual addresses would still be less than the overhead, particularly in energy, from TLB accesses for cache hits. If it was not, compression might be used.)

Hardware COW of zero pages (requiring a hardware-readable free list) is an advantage that could be somewhat trivially applied to more conventional architectures and even the automatic zeroing of stack frames might not be very difficult to add to ARM, x86, RISC-V, or Power. The use of fine-grained valid bits (avoiding read-for-ownership) is entirely microarchitectural. (I am not convinced that byte-granular validity is universally worthwhile, particularly for L2 and beyond when implemented without invalid byte compression. Even with compression, I suspect that limited availability of byte-granular caching might be better, that byte granularity is usually used for contiguous string writes.)

The extensive use of skewed pipelining is integrated into the architectural orientation; wide decode is facilitated by allowing later-stage-execution operations to be decoded later (so earliest-stage operations are parsed out first and the parsing (and decoding) of later-executed operations can be delayed). Skewed pipelines can be beneficial in more conventional architectures (e.g., to reduce load-to-compute-use latency) and dynamic scheduling can use second-chance execution (e.g., branch conditions might be evaluated in two different pipeline stages, which is particularly attractive for early-load/late-compute skewed pipelines — counterflow pipeline is an extreme example of second-chance pipeline design). The Mill's skewed pipeline also allows data dependencies within an instruction bundle in some cases. (Cascaded ALUs are a special skewing case that constrains communication and would be more beneficial in very "wide" design.)

The hardware-managed save and restore of temporary state at function calls facilitates dynamic scheduling of these operations with buffering. The more architecturally explicit stack and "scratchpad" storage may also facilitate distinct storage management since the reuse distance is more easily predicted by hardware. (There are significant similarities to Itaniums Register Stack and NaT metadata storage, but Itanium never implemented dynamic scheduling of spill and fill — the last implementation did provide more GPRs so that overflow/underflow would be less common but I do not think "speculative" fill/spill was supported — which seems to be assumed for the Mill.)

While no mention has been made in the presentations, in theory specifying load delay (criticality) could be useful for bank conflict management or selective tag-data phased access (when scheduled load latency is higher than minimum L1 latency). (Itanium had hints for cache locality, but such probably not only had limited potential utility but even less actual use.)

The use of an intermediate software distribution format solves one problem faced with statically scheduled processors, variation in latency of operations, execution width, etc. While width-portable software can be generated, using a stop indicator to mark the boundary between potential issue groups, performance will generally not be very portable (e.g., software compiled for a 7-wide implementation may have many 7-wide issue groups that must be split into two cycles in a 6-wide implementation). An intermediate format also facilitates materialize at scheduled completion (used by early VLIW) rather than materialize at minimum latency (used by Itanium), reducing "register" pressure. This also facilitates bug fixes. This design does discourage flexible use of core heterogeneity since multiple versions of machine code would be needed, though hardware translation to a wider core (with lower used width) might be practical in some cases.

The specifying of pointer loads presents some opportunites for prefetching and with virtually tagged outer caches and shared (single address space) translation this could be pushed father out in the memory hierarchy. (E.g., a pointer load that misses in L1 might send a prefetch indirection request with the demand request to L2 (which might pass such to L3 and even to memory).) Virtually addressed caches also facilitate skewed associativity, which can provide lower execution time jitter or facilitate lower associativity for a given conflict rate (cuckoo caching can increase this benefit).

> Back around 2017 they were talking big about having a simulator and
> working on an fpga implementation. As far as I can tell absolutely
> none of that ever happened, at least in public, and it has all gone
> very quiet for 3 years or so.
>
> If it looks like a turkey, and it gobbles like a turkey ...

Since Mill Computing was previously entirely funded internally by the founders and the founders' time was not an expense, the early slow development (much of which was before public announcements) is somewhat understandable. There was also significant redirection of effort from prioritizing patents (due to first to file policy) and from compiler front-and-middle-end changes (initially a commercial compilation system was used then LLVM was adopted). I receive the impression that even with external funding, staffing has not exploded. Sweat equity was used for some work, but I received the impression that such did not double development speed. I also get the impression that there is a lot of architectural exploration work expected still; compiler developments will interact with general hardware architecture design, so progress should be expected to be somewhat slow. (Not so long ago, a comp.arch post mentioned that early simulations gave the strange result that an L1 instruction cache was not useful, at least for latency. This implies that significant exploration is still being done.)

I object more to the lack of reference to VLIW precedents. E.g., deferred and pick-up loads are very similar to advanced loads of Itanium and using predication to improve static scheduling under wide issue comes from early VLIW work. Part of this I suspect comes from ignorance (the principals have admitted to having reinvented concepts, wishing that they had known of them beforehand), but part of it probably also comes from marketing: revealing similarities to Itanium would discourage funding.

The claims of ten-fold power-performance-area benefit based on DSP efficiency applied to general purpose code also seems problematic. This is not helped by not specifying same-performance, same-area, or same-power; a conventional architecture might achieve ten-times better PPA by halving performance.

Architecturally, I am disappointed by the lack of speculation and lack of interest in multithreaded cores (static scheduling constrains but does not prohibit such). I also think that insufficient attention is being paid to the communication problem within the core and between threads.



scheduled delay: Technically, the delay can be somewhat dynamic by using an operation to mark when the load must materialize (called a "pick-up load") and by exploiting the fact that a function call saves and restores pending loads so that time spent in such function calls is not counted in the scheduled delay. Even an ordinary ("deferred", i.e., explicit delay) load can allow some limited extra latency hiding/memory-level parallelism since a stall cycles due to a late load result return from a load initiated before (or in parallel) with a second load are not counted in the scheduled delay of the second load.

effectiveness: I am making a distinction from performance since the lower physical design effort implies a significant performance penalty. Comparing to a design with similar customization for a more conventional architecture, the greater width of a Mill implementation would provide greater performance on workloads with very high ILP. If "general" workloads have good enough effectiveness (80%? 90%? relative to aggressive out-of-order) the advantage in high ILP workloads may be attractive when highly customized physical design is not practical — e.g., specific acceleration functions are included in the core. Even ARM does not provide mostly hardened/optimized designs with communication and decode hooks to an accelerator block; probably in part because many uses of acceleration can be managed as memory-mapped I/O (latency and bandwidth of communication with the core is often not extremely critical).
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
What are your ideas for a radically different CPU ISA + physical Arch?Moritz2021/03/20 05:21 AM
  What are your ideas for a radically different CPU ISA + physical Arch?Stanislav Shwartsman2021/03/20 06:22 AM
    I like the analysis of current arch presentedMoritz2021/03/20 10:13 AM
    Did you read this old article?Michael S2021/03/21 02:12 AM
  Deliver programs in IRHugo Décharnes2021/03/20 07:34 AM
    Java bytecode and Wasm exist, why invent something else? (NT)Foo_2021/03/20 08:01 AM
      Java bytecode and Wasm exist, why invent something else?Hugo Décharnes2021/03/20 08:55 AM
        Java bytecode and Wasm exist, why invent something else?Foo_2021/03/20 10:50 AM
          Java bytecode and Wasm exist, why invent something else?Hugo Décharnes2021/03/20 12:40 PM
            Java bytecode and Wasm exist, why invent something else?Foo_2021/03/20 04:54 PM
              It's called source code, no?anonymou52021/03/21 12:43 AM
                It's called source code, no?Foo_2021/03/21 05:07 AM
                Thoughts on software distribution formatsPaul A. Clayton2021/03/22 01:45 PM
    Deliver programs in IRJames2021/03/20 11:24 AM
      Deliver programs in IRHugo Décharnes2021/03/20 12:28 PM
        Deliver programs in IRHugo Décharnes2021/03/20 12:36 PM
    Deliver programs in IRLinus Torvalds2021/03/20 01:20 PM
      Deliver programs in IRHugo Décharnes2021/03/20 01:51 PM
      I'd like to be able to NOT specify order for some things ...Mark Roulo2021/03/20 05:49 PM
        I'd like to be able to NOT specify order for some things ...Jukka Larja2021/03/21 12:26 AM
          NOT (unintentionally) specify orderMoritz2021/03/21 06:00 AM
            NOT (unintentionally) specify orderJukka Larja2021/03/22 07:11 AM
              NOT (unintentionally) specify orderMoritz2021/03/22 12:40 PM
                NOT (unintentionally) specify orderJukka Larja2021/03/23 06:26 AM
          I'd like to be able to NOT specify order for some things ...Mark Roulo2021/03/21 09:47 AM
            I'd like to be able to NOT specify order for some things ...Victor Alander2021/03/21 05:14 PM
      Next architecture will start with MLwumpus2021/03/21 12:24 PM
        Next architecture will start with MLLinus Torvalds2021/03/21 02:38 PM
          Maybe SQL was the better example for general purpose machineswumpus2021/03/22 08:33 AM
            Maybe SQL was the better example for general purpose machinesanon2021/03/22 09:10 AM
        Next architecture will start with MLML will move to PIM2021/03/22 03:51 AM
    Deliver programs in IRanon2021/03/21 03:22 AM
      Deliver programs in IRanon22021/03/21 04:52 AM
        Deliver programs in IRrwessel2021/03/21 05:05 AM
          Deliver programs in IRanon22021/03/21 07:08 PM
            Deliver programs in IRrwessel2021/03/21 10:47 PM
              Deliver programs in IRdmcq2021/03/22 04:33 AM
                Deliver programs in IRrwessel2021/03/22 06:27 AM
  What are your ideas for a radically different CPU ISA + physical Arch?Veedrac2021/03/20 11:27 AM
    Cray MTAanon2021/03/20 06:04 PM
      Cray MTAChester2021/03/20 07:54 PM
        Cray MTAVeedrac2021/03/21 01:33 AM
          Cray MTAnoone2021/03/21 09:15 AM
            Cray MTAVeedrac2021/03/21 10:54 AM
    monolithic 3Dwumpus2021/03/21 12:50 PM
  What are your ideas for a radically different CPU ISA + physical Arch?Anon2021/03/21 12:06 AM
  What are your ideas for a radically different CPU ISA + physical Arch?rwessel2021/03/21 05:02 AM
  What are your ideas for a radically different CPU ISA + physical Arch?juanrga2021/03/21 05:46 AM
  Summery so farMoritz2021/03/21 09:45 AM
    Summery so farrwessel2021/03/21 11:23 AM
      not staticMoritz2021/03/26 10:12 AM
        Dynamic meta instruction encoding for instruction window compressionMoritz2021/03/28 03:28 AM
          redistributing the work between static compiler, dynamic compiler, CPUMoritz2021/04/05 03:21 AM
            redistributing the work between static compiler, dynamic compiler, CPUdmcq2021/04/05 09:27 AM
    Summery so farAnon2021/03/21 08:53 PM
  What are your ideas for a radically different CPU ISA + physical Arch?blaine2021/03/21 10:10 AM
    What are your ideas for a radically different CPU ISA + physical Arch?rwessel2021/03/21 11:26 AM
      What are your ideas for a radically different CPU ISA + physical Arch?rwessel2021/03/21 11:34 AM
        What are your ideas for a radically different CPU ISA + physical Arch?blaine2021/03/21 12:55 PM
          What are your ideas for a radically different CPU ISA + physical Arch?rwessel2021/03/21 01:31 PM
      What are your ideas for a radically different CPU ISA + physical Arch?gallier22021/03/22 12:49 AM
  What are your ideas for a radically different CPU ISA + physical Arch?dmcq2021/03/21 03:50 PM
  Microthread/low IPCEtienne Lorrain2021/03/22 03:22 AM
    Microthread/low IPCdmcq2021/03/22 04:24 AM
      Microthread/low IPCEtienne Lorrain2021/03/22 06:10 AM
        Microthread/low IPCdmcq2021/03/22 08:24 AM
    Microthread/low IPCdmcq2021/03/22 04:53 AM
      Microthread/low IPCEtienne Lorrain2021/03/22 05:46 AM
      Microthread/low IPCAnon2021/03/22 05:47 AM
    Microthread/low IPCHeikki Kultala2021/03/22 05:47 PM
      Microthread/low IPCEtienne Lorrain2021/03/23 03:36 AM
        Microthread/low IPCNyan2021/03/24 03:00 AM
          Microthread/low IPCEtienne Lorrain2021/03/24 04:23 AM
      Microthread/low IPCAnon2021/03/23 08:16 AM
        Microthread/low IPCgai2021/03/23 09:37 AM
          Microthread/low IPCAnon2021/03/23 10:17 AM
            Microthread/low IPCdmcq2021/03/23 12:42 PM
  Have you looked at "The Mill CPU" project? (nt)Anon C2021/03/22 06:21 AM
    Have you looked at "The Mill CPU" project? (nt)Moritz2021/03/22 12:13 PM
      Have you looked at "The Mill CPU" project? (nt)Andrew Clough2021/03/22 04:27 PM
        The Mill = vaporwareRichardC2021/03/23 12:47 PM
          The Mill = vaporwareMichael S2021/03/23 01:58 PM
          The Mill = vaporwareCarson2021/03/23 06:17 PM
          The Mill = doomed but interestingAndrew Clough2021/03/24 08:06 AM
            Solution in search of a problemwumpus2021/03/24 08:52 AM
              Solution in search of a problemdmcq2021/03/24 10:22 AM
          never-ware != vaporware (at least in connotation)Paul A. Clayton2021/03/24 10:37 AM
  What are your ideas for a radically different CPU ISA + physical Arch?anonini2021/03/22 08:28 AM
    microcode that can combine instructionMoritz2021/03/22 12:26 PM
  What are your ideas for a radically different CPU ISA + physical Arch?anony2021/03/22 10:16 AM
    Totally clueless.Heikki Kultala2021/03/22 05:53 PM
  Hierarchical instruction setHeikki Kultala2021/03/22 06:52 PM
    Hierarchical instruction setVeedrac2021/03/23 03:49 AM
      Hierarchical instruction setHeikki Kultala2021/03/23 06:46 AM
        Hierarchical instruction setEtienne Lorrain2021/03/23 07:16 AM
          microthreads on OS call/exceptionHeikki Kultala2021/03/23 07:34 AM
        Hierarchical instruction setVeedrac2021/03/23 09:31 AM
          Hierarchical instruction setEtienne Lorrain2021/03/24 01:13 AM
            Hierarchical instruction setVeedrac2021/03/24 07:11 AM
    Hierarchical instruction setAnon2021/03/23 08:39 AM
  What are your ideas for a radically different CPU ISA + physical Arch?Paul A. Clayton2021/03/26 08:21 AM
    What are your ideas for a radically different CPU ISA + physical Arch?wumpus2021/03/26 09:45 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊