My take on the Mill (hopefully using more conventional terminology)

By: Ivan Godard (ivan.delete@this.millcomputing.com), April 20, 2015 5:08 pm
Room: Moderated Discussions
This is an interesting take on the architecture, generally a valid way to view it from a hardware perspective. We generally describe the architecture differently, as seen by software; rather few programmers know what a bypass network is :-) Detailed comments below interlinear.

Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on April 18, 2015 1:06 am wrote:
> After going through most of the Mill's available documentation and mulling over it for a while (no
> pun intended) I think I've figured out how it's supposed to work. The following is mostly speculation
> on my side but it aligns with the data points we have fairly well so I hope it's correct.
>
> The Mill appears to be an unconventional dynamic dataflow machine made of an execution core, scratchpad
> memories as well as dedicated machinery used to deal with load/store operations and instruction fetch.
>

At that level the Mill is very SOC-like - lots of quasi-independent and quasi-asynchronous pieces, all hooked together so as to keep flows going without buffering.

> The execution core is made of multiple execution units both single-cycle (ALUs, AGUs) and multi-cycle
> (FPUs, etc...). The execution units are tied together by a programmable forwarding network that can
> drive the outputs of a unit to the inputs of another one although with certain model-specific limitations.
> This network also supports fan-out from a single unit and is fed externally from a scratchpad memory
> which it can also write its outputs to (i.e. the spiller). The connection between the scratchpad memory
> and the execution core can be quite wide in larger models as all operands are read and written in parallel.
> Outputs of the execution core can be fed to the inputs for the next instruction, or can go to external
> components (load/store machinery, instruction fetch, more on this later).

The spiller and the scratchpad are separate units. The spiller only handles save/restore of state across call-like operations, whereas the scratchpad is a general read-write SRAM akin to a register file, used for long-lived data.

> An instruction is essentially a configuration for the execution core forwarding network plus L/S operations.

Plus control-flow operations.

> From a compiler POV an instruction can be seen as a basic block with data dependencies and one or more
> exit points (depending on the control flow). Once loaded it can be executed multiple times by keeping
> the operands flowing through the EUs (which should be rather effective for executing small loops). The
> execution core cannot execute branches per se (though being very wide it can be used for predication);
> rather after all conditional operations in an instruction have executed it outputs the address of the
> next instruction. If an instruction contains only one exit point this will be the address of the next
> instruction, if it had multiple ones then the address is predicted so that fetching the following instruction
> can start earlier. The same mechanism is probably reused to implement indirect branches.

Here you have veered from the actual architecture. The core does execute branches, potentially several of them in an instructions; there's a rule which determines the winner if more than one evaluates to taken, but otherwise branches are rather conventional. Branches are not predicted taken/untaken; instead there is in effect a BTB entry for each Extended Basic Block, which gives the predicted exit target. It does not matter how many branches are in the block's instructions, we predict where (and when) control will leave the block without caring how that happens.

> Different operations within the same instruction can execute on the same EU in different
> cycles. This is what the Mill's documentation refers as phasing and which most people
> considered to be caused by skewed pipelines. It's rather caused by the fact that EUs
> can be linked in a chain thus creating a pipeline within the execution core.

Yes: instructions (can) contain short dataflows rather than individual actions.

> Load/store operations are dealt with outside of the execution core. As mentioned in the documentation
> the core only contains AGUs whose outputs are sent to a piece of external machinery that is responsible
> for executing loads and stores and dealing with their variable-latency nature. Loads can be executed
> early and their results held in a temporary buffer to be fed to the execution core when an instruction
> requires them. This arrangement explains in my eyes why the Mill doesn't have conventional virtual memory
> addressing. Having it would require a TLB-like structure which is also variable-latency by nature and
> couldn't be integrated easily in a data-flow execution core. If the result of a load is not available
> when a dependent instruction is expected to run then the entire execution core is stalled waiting for
> it. The load/store machinery (unit?) is thus decoupled from the execution core (and from the AGUs, something
> that was mentioned multiple times) and operates asynchronously from it.

There is a TLB-like structure that handles protection, the PLB. However checking can be done in parallel with the load, rather than in front of it as in a conventional TLB.

> Now, if the Mill works similarly to how I described it above it should be extremely good at running DPS/streaming
> codes and possibly do so at a significant advantage in perf/W over a comparable VLIW processor. This is made
> possible by the fact that the core is fundamentally just a bunch of EUs, there's no register file to read
> from or write to, no scoreboarding, etc... The hardware at its core should be truly very simple.


It is :-)


> For general purpose codes it's going to be a different story. It will live and die by the ability of
> the compiler to remove as much control flow as possible (hello hours long, memory churning, PGO-based,
> LTO builds with aggressive inlining and devirtualization).

Not really. The Mill uses control-flow prediction like any modern machine, and can expect to get the same prediction quality (for any given predictor algorithm) as other CPUs, without special compilation. The major difference is the cost when the predictor gets it wrong: five cycles (when the cache contains the correct target), rather than the 15-20 typical.

The machine is very wide, which lends itself to speculative execution down multiple execution paths while replacing control flow with if-conversion. The same methods are routine today when the architecture supports predication, a selection op, or a conditional move, all of which are common; the Mill differs only in the degree that you can trade performance against the power consumption of speculation. LTO and similar methods will have the same advantages - and drawbacks - as other CPUss, and are certainly not required on a Mill.

> Its performance will also be heavily dependent
> on how effective is the load/store machinery at handling multiple operations in parallel as well as
> the ability of the compiler to generate static MLP and feed it with enough operations before the execution
> core is forced to stall. Anyway I wouldn't want to iterate over linked lists with it.

MLP is an issue for any architecture, especially when there are chained memory dependencies as in a linked list. The Mill will be slow on a list, but no slower than anybody else - we all need to know the next address before we can fetch from it, so there's no MLP for anyone. In structures where there is the possibility of MLP (say iterating over an array) then the Mill can keep the memory pipes full, although only part of the mechanism to do that has been publicly described; the rest awaits filing patents.


> This leads to the point that was raised multiple times in previous discussion: it's truly strange
> that such an unconventional architecture doesn't already have a compiler. That's the very first
> thing that should have been written and it's no wonder that adapting exiting compilers is proving
> to be hard. The late phases of their back-ends are useless for the Mill (register allocation, spill/fill
> generation, scheduling) and are even counterproductive in that they throw away data-flow information
> which is the key piece of information one needs to produce Mill instructions.

LLVM didn't exist when we started the Mill, and gcc was pretty much unusable, so we rolled our own using the EDG front end (also used by ICC and many other compilers, and a terrific product we highly recommend) and our own middle and back end. We got it to the point of handling simple things, despite having to track a *rapidly* moving target, but realized that we would be better off in an ecosystem once the machine design stabilized enough and mostly-suitable ecosystems became available. We selected LLVM, and are hard at work on the port, and are again able to handle simple things, but the effort is non-trivial.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/18 01:06 AM
  My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/19 09:53 AM
    My take on the Mill (hopefully using more conventional terminology)ex-apple2015/04/19 10:53 AM
    My take on the Mill (hopefully using more conventional terminology)Eric Bron nli2015/04/20 11:43 AM
      My take on the Mill (hopefully using more conventional terminology)Eric Bron2015/04/20 11:46 AM
    My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 05:27 PM
      My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/20 10:03 PM
      My take on the Mill (hopefully using more conventional terminology)sylt2015/04/21 09:49 AM
      My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/21 12:57 PM
        My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 02:47 PM
          My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/21 03:22 PM
            My take on the Mill (hopefully using more conventional terminology)Peter Lund2015/04/22 01:35 AM
          My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/22 02:42 AM
        My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/22 06:23 AM
          My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/22 06:59 PM
            My take on the Mill (hopefully using more conventional terminology)RichardC2015/04/22 07:53 PM
              My take on the Mill (hopefully using more conventional terminology)dmcq2015/04/23 02:45 AM
    My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 12:40 PM
      how about Freescale LS1022A ?Michael S2015/04/21 03:53 PM
        how about Freescale LS1022A ?EduardoS2015/04/21 04:44 PM
      My take on the Mill (hopefully using more conventional terminology)rwessel2015/04/21 04:40 PM
        My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 04:45 PM
          My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 05:03 PM
      My take on the Mill (hopefully using more conventional terminology)Ronald Maas2015/04/21 08:09 PM
  My take on the Mill (hopefully using more conventional terminology)Maynard Handley2015/04/19 11:13 AM
    My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 05:34 PM
      My take on the Mill (hopefully using more conventional terminology)Brett2015/04/20 05:43 PM
        My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 08:59 PM
      Fixed size registers?David Kanter2015/04/21 08:26 AM
    My take on the Mill (hopefully using more conventional terminology)EduardoS2015/04/21 12:51 PM
      My take on the Mill (hopefully using more conventional terminology)Maynard Handley2015/04/21 02:02 PM
      Large register filejuanrga2015/04/21 05:01 PM
        Large register fileEduardoS2015/04/21 05:53 PM
  My take on the Mill (hopefully using more conventional terminology)Ivan Godard2015/04/20 05:08 PM
    My take on the Mill (hopefully using more conventional terminology)anon2015/04/21 10:27 PM
      My take on the Mill (hopefully using more conventional terminology)Exophase2015/04/21 10:40 PM
        My take on the Mill (hopefully using more conventional terminology)anon2015/04/23 02:59 AM
    My take on the Mill (hopefully using more conventional terminology)Gabriele Svelto2015/04/22 12:47 AM
    PLB description seems to missunderstand how TLB works.Jouni Osmala2015/04/23 03:19 AM
      PLB description seems to missunderstand how TLB works.dmcq2015/04/23 04:54 AM
      Mill is single-address-space-orientedPaul A. Clayton2015/04/23 07:00 AM
        Mill is single-address-space-orienteddmcq2015/04/23 09:17 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊