My take on the Mill (hopefully using more conventional terminology)

By: Gabriele Svelto (gabriele.svelto.delete@this.gmail.com), April 18, 2015 12:06 am
Room: Moderated Discussions
After going through most of the Mill's available documentation and mulling over it for a while (no pun intended) I think I've figured out how it's supposed to work. The following is mostly speculation on my side but it aligns with the data points we have fairly well so I hope it's correct.

The Mill appears to be an unconventional dynamic dataflow machine made of an execution core, scratchpad memories as well as dedicated machinery used to deal with load/store operations and instruction fetch.

The execution core is made of multiple execution units both single-cycle (ALUs, AGUs) and multi-cycle (FPUs, etc...). The execution units are tied together by a programmable forwarding network that can drive the outputs of a unit to the inputs of another one although with certain model-specific limitations. This network also supports fan-out from a single unit and is fed externally from a scratchpad memory which it can also write its outputs to (i.e. the spiller). The connection between the scratchpad memory and the execution core can be quite wide in larger models as all operands are read and written in parallel. Outputs of the execution core can be fed to the inputs for the next instruction, or can go to external components (load/store machinery, instruction fetch, more on this later).

An instruction is essentially a configuration for the execution core forwarding network plus L/S operations. From a compiler POV an instruction can be seen as a basic block with data dependencies and one or more exit points (depending on the control flow). Once loaded it can be executed multiple times by keeping the operands flowing through the EUs (which should be rather effective for executing small loops). The execution core cannot execute branches per se (though being very wide it can be used for predication); rather after all conditional operations in an instruction have executed it outputs the address of the next instruction. If an instruction contains only one exit point this will be the address of the next instruction, if it had multiple ones then the address is predicted so that fetching the following instruction can start earlier. The same mechanism is probably reused to implement indirect branches.

Different operations within the same instruction can execute on the same EU in different cycles. This is what the Mill's documentation refers as phasing and which most people considered to be caused by skewed pipelines. It's rather caused by the fact that EUs can be linked in a chain thus creating a pipeline within the execution core.

Load/store operations are dealt with outside of the execution core. As mentioned in the documentation the core only contains AGUs whose outputs are sent to a piece of external machinery that is responsible for executing loads and stores and dealing with their variable-latency nature. Loads can be executed early and their results held in a temporary buffer to be fed to the execution core when an instruction requires them. This arrangement explains in my eyes why the Mill doesn't have conventional virtual memory addressing. Having it would require a TLB-like structure which is also variable-latency by nature and couldn't be integrated easily in a data-flow execution core. If the result of a load is not available when a dependent instruction is expected to run then the entire execution core is stalled waiting for it. The load/store machinery (unit?) is thus decoupled from the execution core (and from the AGUs, something that was mentioned multiple times) and operates asynchronously from it.

Now, if the Mill works similarly to how I described it above it should be extremely good at running DPS/streaming codes and possibly do so at a significant advantage in perf/W over a comparable VLIW processor. This is made possible by the fact that the core is fundamentally just a bunch of EUs, there's no register file to read from or write to, no scoreboarding, etc... The hardware at its core should be truly very simple.

For general purpose codes it's going to be a different story. It will live and die by the ability of the compiler to remove as much control flow as possible (hello hours long, memory churning, PGO-based, LTO builds with aggressive inlining and devirtualization). Its performance will also be heavily dependent on how effective is the load/store machinery at handling multiple operations in parallel as well as the ability of the compiler to generate static MLP and feed it with enough operations before the execution core is forced to stall. Anyway I wouldn't want to iterate over linked lists with it.

This leads to the point that was raised multiple times in previous discussion: it's truly strange that such an unconventional architecture doesn't already have a compiler. That's the very first thing that should have been written and it's no wonder that adapting exiting compilers is proving to be hard. The late phases of their back-ends are useless for the Mill (register allocation, spill/fill generation, scheduling) and are even counterproductive in that they throw away data-flow information which is the key piece of information one needs to produce Mill instructions.
 Next Post in Thread >
TopicPosted ByDate
My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/18 12:06 AM
  My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/19 08:53 AM
    My take on the Mill (hopefully using more conventional terminology)ex-apple2015/04/19 09:53 AM
    My take on the Mill (hopefully using more conventional terminology)Eric Bron nli2015/04/20 10:43 AM
      My take on the Mill (hopefully using more conventional terminology)Eric Bron2015/04/20 10:46 AM
    My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 04:27 PM
      My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/20 09:03 PM
      My take on the Mill (hopefully using more conventional terminology)sylt2015/04/21 08:49 AM
      My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/21 11:57 AM
        My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 01:47 PM
          My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/21 02:22 PM
            My take on the Mill (hopefully using more conventional terminology)Peter Lund2015/04/22 12:35 AM
          My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/22 01:42 AM
        My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/22 05:23 AM
          My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/22 05:59 PM
            My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/22 06:53 PM
              My take on the Mill (hopefully using more conventional terminology)dmcq2015/04/23 01:45 AM
    My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 11:40 AM
      how about Freescale LS1022A ?Michael S2015/04/21 02:53 PM
        how about Freescale LS1022A ?EduardoS2015/04/21 03:44 PM
      My take on the Mill (hopefully using more conventional terminology)rwessel2015/04/21 03:40 PM
        My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 03:45 PM
          My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 04:03 PM
      My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/21 07:09 PM
  My take on the Mill (hopefully using more conventional terminology)Maynard Handley2015/04/19 10:13 AM
    My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 04:34 PM
      My take on the Mill (hopefully using more conventional terminology)Brett2015/04/20 04:43 PM
        My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 07:59 PM
      Fixed size registers?David Kanter2015/04/21 07:26 AM
    My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 11:51 AM
      My take on the Mill (hopefully using more conventional terminology)Maynard Handley2015/04/21 01:02 PM
      Large register filejuanrga2015/04/21 04:01 PM
        Large register fileEduardoS2015/04/21 04:53 PM
  My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 04:08 PM
    My take on the Mill (hopefully using more conventional terminology)anon2015/04/21 09:27 PM
      My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 09:40 PM
        My take on the Mill (hopefully using more conventional terminology)anon2015/04/23 01:59 AM
    My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/21 11:47 PM
    PLB description seems to missunderstand how TLB works.Jouni Osmala2015/04/23 02:19 AM
      PLB description seems to missunderstand how TLB works.dmcq2015/04/23 03:54 AM
      Mill is single-address-space-orientedPaul A. Clayton2015/04/23 06:00 AM
        Mill is single-address-space-orienteddmcq2015/04/23 08:17 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊