By: Ronald Maas (rmaas.delete@this.wiwo.nl), April 19, 2015 9:53 am
Room: Moderated Discussions
Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on April 18, 2015 1:06 am wrote:
> After going through most of the Mill's available documentation and mulling over it for a while (no
> pun intended) I think I've figured out how it's supposed to work. The following is mostly speculation
> on my side but it aligns with the data points we have fairly well so I hope it's correct.
>
> The Mill appears to be an unconventional dynamic dataflow machine made of an execution core, scratchpad
> memories as well as dedicated machinery used to deal with load/store operations and instruction fetch.
>
> The execution core is made of multiple execution units both single-cycle (ALUs, AGUs) and multi-cycle
> (FPUs, etc...). The execution units are tied together by a programmable forwarding network that can
> drive the outputs of a unit to the inputs of another one although with certain model-specific limitations.
> This network also supports fan-out from a single unit and is fed externally from a scratchpad memory
> which it can also write its outputs to (i.e. the spiller). The connection between the scratchpad memory
> and the execution core can be quite wide in larger models as all operands are read and written in parallel.
> Outputs of the execution core can be fed to the inputs for the next instruction, or can go to external
> components (load/store machinery, instruction fetch, more on this later).
>
> An instruction is essentially a configuration for the execution core forwarding network plus L/S operations.
> From a compiler POV an instruction can be seen as a basic block with data dependencies and one or more
> exit points (depending on the control flow). Once loaded it can be executed multiple times by keeping
> the operands flowing through the EUs (which should be rather effective for executing small loops). The
> execution core cannot execute branches per se (though being very wide it can be used for predication);
> rather after all conditional operations in an instruction have executed it outputs the address of the
> next instruction. If an instruction contains only one exit point this will be the address of the next
> instruction, if it had multiple ones then the address is predicted so that fetching the following instruction
> can start earlier. The same mechanism is probably reused to implement indirect branches.
>
> Different operations within the same instruction can execute on the same EU in different
> cycles. This is what the Mill's documentation refers as phasing and which most people
> considered to be caused by skewed pipelines. It's rather caused by the fact that EUs
> can be linked in a chain thus creating a pipeline within the execution core.
>
> Load/store operations are dealt with outside of the execution core. As mentioned in the documentation
> the core only contains AGUs whose outputs are sent to a piece of external machinery that is responsible
> for executing loads and stores and dealing with their variable-latency nature. Loads can be executed
> early and their results held in a temporary buffer to be fed to the execution core when an instruction
> requires them. This arrangement explains in my eyes why the Mill doesn't have conventional virtual memory
> addressing. Having it would require a TLB-like structure which is also variable-latency by nature and
> couldn't be integrated easily in a data-flow execution core. If the result of a load is not available
> when a dependent instruction is expected to run then the entire execution core is stalled waiting for
> it. The load/store machinery (unit?) is thus decoupled from the execution core (and from the AGUs, something
> that was mentioned multiple times) and operates asynchronously from it.
>
> Now, if the Mill works similarly to how I described it above it should be extremely good at running DPS/streaming
> codes and possibly do so at a significant advantage in perf/W over a comparable VLIW processor. This is made
> possible by the fact that the core is fundamentally just a bunch of EUs, there's no register file to read
> from or write to, no scoreboarding, etc... The hardware at its core should be truly very simple.
>
> For general purpose codes it's going to be a different story. It will live and die by the ability of
> the compiler to remove as much control flow as possible (hello hours long, memory churning, PGO-based,
> LTO builds with aggressive inlining and devirtualization). Its performance will also be heavily dependent
> on how effective is the load/store machinery at handling multiple operations in parallel as well as
> the ability of the compiler to generate static MLP and feed it with enough operations before the execution
> core is forced to stall. Anyway I wouldn't want to iterate over linked lists with it.
>
> This leads to the point that was raised multiple times in previous discussion: it's truly strange
> that such an unconventional architecture doesn't already have a compiler. That's the very first
> thing that should have been written and it's no wonder that adapting exiting compilers is proving
> to be hard. The late phases of their back-ends are useless for the Mill (register allocation, spill/fill
> generation, scheduling) and are even counterproductive in that they throw away data-flow information
> which is the key piece of information one needs to produce Mill instructions.
I agree with Ivan Godard observation that even the most advanced traditional processor core spent a fraction of the transistors and energy on actual useful work, calculations, data moves, etc.
But I think with a different approach he would have a far better chance of success:
1) As you mentioned in your post, there is no compiler for Mill. A while ago I asked him about it in this forum, and he answered his team lacked bandwidth to spend much effort building a compiler (or adapting GCC/LLVM). If he would build a software model of the Mill and at the same time build a compiler and profiling tools needed, he would be able to test effectively which ideas work and which won't. There is literally tons of existing open source code available that can be used as input to improve the design where needed.
2) I believe having live profiling data is essential to better extract parallelism in existing code. So maybe a two pronged approach is needed. A first level compiler which translated high level language to some intermediate machine representation. Then (like nVidia Denver) a second level compiler running on the processor itself dynamically translates the intermediate representation to the native code that is actually executed by the hardware. There maybe many other ideas worth pursuing, but without iterative approach, everything Mill related will always be a shot in the dark.
3) Ivan Godard would be much better off to put all his work and ideas in public domain. Trying to establish a community that can help him achieve his goals without having the obligation to pay any salaries. For example RiscV started about 3 years ago and they already have most basic building blocks in place. We are not living in the 1970 anymore where a small team can successfully launch a processor like 6502. You really need 1000s of people and big pockets to be able to successfully launch a new ISA and be able to make some earnings of it.
4) Ivan Godard mentioned he wants to target a broad range of performance levels, all the way from low-end embedded to the high-end. There is no way anyone can compete successfully on the low-end with companies like Allwinner and Mediatec who are able to make a healthy profit selling quad code 64-bit SoC for 5 dollars a piece. Better to concentrate on high-end where achieving high IPC is going to be appreciated.
Just some thoughts
Ronald
> After going through most of the Mill's available documentation and mulling over it for a while (no
> pun intended) I think I've figured out how it's supposed to work. The following is mostly speculation
> on my side but it aligns with the data points we have fairly well so I hope it's correct.
>
> The Mill appears to be an unconventional dynamic dataflow machine made of an execution core, scratchpad
> memories as well as dedicated machinery used to deal with load/store operations and instruction fetch.
>
> The execution core is made of multiple execution units both single-cycle (ALUs, AGUs) and multi-cycle
> (FPUs, etc...). The execution units are tied together by a programmable forwarding network that can
> drive the outputs of a unit to the inputs of another one although with certain model-specific limitations.
> This network also supports fan-out from a single unit and is fed externally from a scratchpad memory
> which it can also write its outputs to (i.e. the spiller). The connection between the scratchpad memory
> and the execution core can be quite wide in larger models as all operands are read and written in parallel.
> Outputs of the execution core can be fed to the inputs for the next instruction, or can go to external
> components (load/store machinery, instruction fetch, more on this later).
>
> An instruction is essentially a configuration for the execution core forwarding network plus L/S operations.
> From a compiler POV an instruction can be seen as a basic block with data dependencies and one or more
> exit points (depending on the control flow). Once loaded it can be executed multiple times by keeping
> the operands flowing through the EUs (which should be rather effective for executing small loops). The
> execution core cannot execute branches per se (though being very wide it can be used for predication);
> rather after all conditional operations in an instruction have executed it outputs the address of the
> next instruction. If an instruction contains only one exit point this will be the address of the next
> instruction, if it had multiple ones then the address is predicted so that fetching the following instruction
> can start earlier. The same mechanism is probably reused to implement indirect branches.
>
> Different operations within the same instruction can execute on the same EU in different
> cycles. This is what the Mill's documentation refers as phasing and which most people
> considered to be caused by skewed pipelines. It's rather caused by the fact that EUs
> can be linked in a chain thus creating a pipeline within the execution core.
>
> Load/store operations are dealt with outside of the execution core. As mentioned in the documentation
> the core only contains AGUs whose outputs are sent to a piece of external machinery that is responsible
> for executing loads and stores and dealing with their variable-latency nature. Loads can be executed
> early and their results held in a temporary buffer to be fed to the execution core when an instruction
> requires them. This arrangement explains in my eyes why the Mill doesn't have conventional virtual memory
> addressing. Having it would require a TLB-like structure which is also variable-latency by nature and
> couldn't be integrated easily in a data-flow execution core. If the result of a load is not available
> when a dependent instruction is expected to run then the entire execution core is stalled waiting for
> it. The load/store machinery (unit?) is thus decoupled from the execution core (and from the AGUs, something
> that was mentioned multiple times) and operates asynchronously from it.
>
> Now, if the Mill works similarly to how I described it above it should be extremely good at running DPS/streaming
> codes and possibly do so at a significant advantage in perf/W over a comparable VLIW processor. This is made
> possible by the fact that the core is fundamentally just a bunch of EUs, there's no register file to read
> from or write to, no scoreboarding, etc... The hardware at its core should be truly very simple.
>
> For general purpose codes it's going to be a different story. It will live and die by the ability of
> the compiler to remove as much control flow as possible (hello hours long, memory churning, PGO-based,
> LTO builds with aggressive inlining and devirtualization). Its performance will also be heavily dependent
> on how effective is the load/store machinery at handling multiple operations in parallel as well as
> the ability of the compiler to generate static MLP and feed it with enough operations before the execution
> core is forced to stall. Anyway I wouldn't want to iterate over linked lists with it.
>
> This leads to the point that was raised multiple times in previous discussion: it's truly strange
> that such an unconventional architecture doesn't already have a compiler. That's the very first
> thing that should have been written and it's no wonder that adapting exiting compilers is proving
> to be hard. The late phases of their back-ends are useless for the Mill (register allocation, spill/fill
> generation, scheduling) and are even counterproductive in that they throw away data-flow information
> which is the key piece of information one needs to produce Mill instructions.
I agree with Ivan Godard observation that even the most advanced traditional processor core spent a fraction of the transistors and energy on actual useful work, calculations, data moves, etc.
But I think with a different approach he would have a far better chance of success:
1) As you mentioned in your post, there is no compiler for Mill. A while ago I asked him about it in this forum, and he answered his team lacked bandwidth to spend much effort building a compiler (or adapting GCC/LLVM). If he would build a software model of the Mill and at the same time build a compiler and profiling tools needed, he would be able to test effectively which ideas work and which won't. There is literally tons of existing open source code available that can be used as input to improve the design where needed.
2) I believe having live profiling data is essential to better extract parallelism in existing code. So maybe a two pronged approach is needed. A first level compiler which translated high level language to some intermediate machine representation. Then (like nVidia Denver) a second level compiler running on the processor itself dynamically translates the intermediate representation to the native code that is actually executed by the hardware. There maybe many other ideas worth pursuing, but without iterative approach, everything Mill related will always be a shot in the dark.
3) Ivan Godard would be much better off to put all his work and ideas in public domain. Trying to establish a community that can help him achieve his goals without having the obligation to pay any salaries. For example RiscV started about 3 years ago and they already have most basic building blocks in place. We are not living in the 1970 anymore where a small team can successfully launch a processor like 6502. You really need 1000s of people and big pockets to be able to successfully launch a new ISA and be able to make some earnings of it.
4) Ivan Godard mentioned he wants to target a broad range of performance levels, all the way from low-end embedded to the high-end. There is no way anyone can compete successfully on the low-end with companies like Allwinner and Mediatec who are able to make a healthy profit selling quad code 64-bit SoC for 5 dollars a piece. Better to concentrate on high-end where achieving high IPC is going to be appreciated.
Just some thoughts
Ronald