My take on the Mill (hopefully using more conventional terminology)

By: Maynard Handley (name99.delete@this.name99.org), April 19, 2015 11:13 am
Room: Moderated Discussions
Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on April 18, 2015 1:06 am wrote:
> After going through most of the Mill's available documentation and mulling over it for a while (no
> pun intended) I think I've figured out how it's supposed to work. The following is mostly speculation
> on my side but it aligns with the data points we have fairly well so I hope it's correct.
>
> The Mill appears to be an unconventional dynamic dataflow machine made of an execution core, scratchpad
> memories as well as dedicated machinery used to deal with load/store operations and instruction fetch.
>
> The execution core is made of multiple execution units both single-cycle (ALUs, AGUs) and multi-cycle
> (FPUs, etc...). The execution units are tied together by a programmable forwarding network that can
> drive the outputs of a unit to the inputs of another one although with certain model-specific limitations.
> This network also supports fan-out from a single unit and is fed externally from a scratchpad memory
> which it can also write its outputs to (i.e. the spiller). The connection between the scratchpad memory
> and the execution core can be quite wide in larger models as all operands are read and written in parallel.
> Outputs of the execution core can be fed to the inputs for the next instruction, or can go to external
> components (load/store machinery, instruction fetch, more on this later).
>
> An instruction is essentially a configuration for the execution core forwarding network plus L/S operations.
> From a compiler POV an instruction can be seen as a basic block with data dependencies and one or more
> exit points (depending on the control flow). Once loaded it can be executed multiple times by keeping
> the operands flowing through the EUs (which should be rather effective for executing small loops). The
> execution core cannot execute branches per se (though being very wide it can be used for predication);
> rather after all conditional operations in an instruction have executed it outputs the address of the
> next instruction. If an instruction contains only one exit point this will be the address of the next
> instruction, if it had multiple ones then the address is predicted so that fetching the following instruction
> can start earlier. The same mechanism is probably reused to implement indirect branches.
>
> Different operations within the same instruction can execute on the same EU in different
> cycles. This is what the Mill's documentation refers as phasing and which most people
> considered to be caused by skewed pipelines. It's rather caused by the fact that EUs
> can be linked in a chain thus creating a pipeline within the execution core.
>
> Load/store operations are dealt with outside of the execution core. As mentioned in the documentation
> the core only contains AGUs whose outputs are sent to a piece of external machinery that is responsible
> for executing loads and stores and dealing with their variable-latency nature. Loads can be executed
> early and their results held in a temporary buffer to be fed to the execution core when an instruction
> requires them. This arrangement explains in my eyes why the Mill doesn't have conventional virtual memory
> addressing. Having it would require a TLB-like structure which is also variable-latency by nature and
> couldn't be integrated easily in a data-flow execution core. If the result of a load is not available
> when a dependent instruction is expected to run then the entire execution core is stalled waiting for
> it. The load/store machinery (unit?) is thus decoupled from the execution core (and from the AGUs, something
> that was mentioned multiple times) and operates asynchronously from it.
>
> Now, if the Mill works similarly to how I described it above it should be extremely good at running DPS/streaming
> codes and possibly do so at a significant advantage in perf/W over a comparable VLIW processor. This is made
> possible by the fact that the core is fundamentally just a bunch of EUs, there's no register file to read
> from or write to, no scoreboarding, etc... The hardware at its core should be truly very simple.
>
> For general purpose codes it's going to be a different story. It will live and die by the ability of
> the compiler to remove as much control flow as possible (hello hours long, memory churning, PGO-based,
> LTO builds with aggressive inlining and devirtualization). Its performance will also be heavily dependent
> on how effective is the load/store machinery at handling multiple operations in parallel as well as
> the ability of the compiler to generate static MLP and feed it with enough operations before the execution
> core is forced to stall. Anyway I wouldn't want to iterate over linked lists with it.
>
> This leads to the point that was raised multiple times in previous discussion: it's truly strange
> that such an unconventional architecture doesn't already have a compiler. That's the very first
> thing that should have been written and it's no wonder that adapting exiting compilers is proving
> to be hard. The late phases of their back-ends are useless for the Mill (register allocation, spill/fill
> generation, scheduling) and are even counterproductive in that they throw away data-flow information
> which is the key piece of information one needs to produce Mill instructions.

How about an even simpler analysis?
The most interesting and important part of Mill is the (I think correct) intuition that most register data is temporary: it's read off the bypass network, goes through an EU, and the result is again read off the bypass network. As such, the machinery that exists for giving it a name is pointless overhead, and the ideal situation would be an ISA that did not require giving these sorts of temporaries a name, with all that implies.

The question, then, is
(a) how can you do this?
(b) how do you deal with the fraction of register data that is NOT of this form?
There is also a third (implementation, rather than architectural) issue of what you do when an interrupt occurs. It seems like you could just drain in-flight instructions, but you have the problem of all those successor instructions that were expecting to pull their data off the bypass network.
An apparently similar (but perhaps quite different) problem is that of maintaining backup state across speculation points.

Registers (ie NAMED registers) are useful enough that one suspects one wants the ISA to provide them, and at least 8 of them. But I think it is true that un-named data that's accessed more or less implicitly is conceptually an important insight. The question is: can you use this insight to improve the core while still
- providing some named registers
- dealing with interrupts and speculation
- given that you want to provide some named registers, and that you want some sort of background storage to handle interrupts and speculation, does the win (some ISA bits, maybe smaller remap structures) from not visibly naming most of your in-flight data buy you enough to be worth boiling the ocean (new compiler ideas, new micro-architectural ideas)?
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/18 01:06 AM
  My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/19 09:53 AM
    My take on the Mill (hopefully using more conventional terminology)ex-apple2015/04/19 10:53 AM
    My take on the Mill (hopefully using more conventional terminology)Eric Bron nli2015/04/20 11:43 AM
      My take on the Mill (hopefully using more conventional terminology)Eric Bron2015/04/20 11:46 AM
    My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 05:27 PM
      My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/20 10:03 PM
      My take on the Mill (hopefully using more conventional terminology)sylt2015/04/21 09:49 AM
      My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/21 12:57 PM
        My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 02:47 PM
          My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/21 03:22 PM
            My take on the Mill (hopefully using more conventional terminology)Peter Lund2015/04/22 01:35 AM
          My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/22 02:42 AM
        My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/22 06:23 AM
          My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/22 06:59 PM
            My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/22 07:53 PM
              My take on the Mill (hopefully using more conventional terminology)dmcq2015/04/23 02:45 AM
    My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 12:40 PM
      how about Freescale LS1022A ?Michael S2015/04/21 03:53 PM
        how about Freescale LS1022A ?EduardoS2015/04/21 04:44 PM
      My take on the Mill (hopefully using more conventional terminology)rwessel2015/04/21 04:40 PM
        My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 04:45 PM
          My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 05:03 PM
      My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/21 08:09 PM
  My take on the Mill (hopefully using more conventional terminology)Maynard Handley2015/04/19 11:13 AM
    My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 05:34 PM
      My take on the Mill (hopefully using more conventional terminology)Brett2015/04/20 05:43 PM
        My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 08:59 PM
      Fixed size registers?David Kanter2015/04/21 08:26 AM
    My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 12:51 PM
      My take on the Mill (hopefully using more conventional terminology)Maynard Handley2015/04/21 02:02 PM
      Large register filejuanrga2015/04/21 05:01 PM
        Large register fileEduardoS2015/04/21 05:53 PM
  My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 05:08 PM
    My take on the Mill (hopefully using more conventional terminology)anon2015/04/21 10:27 PM
      My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 10:40 PM
        My take on the Mill (hopefully using more conventional terminology)anon2015/04/23 02:59 AM
    My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/22 12:47 AM
    PLB description seems to missunderstand how TLB works.Jouni Osmala2015/04/23 03:19 AM
      PLB description seems to missunderstand how TLB works.dmcq2015/04/23 04:54 AM
      Mill is single-address-space-orientedPaul A. Clayton2015/04/23 07:00 AM
        Mill is single-address-space-orienteddmcq2015/04/23 09:17 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? ūüćä