By: Gabriele Svelto (gabriele.svelto.delete@this.gmail.com), April 22, 2015 12:47 am
Room: Moderated Discussions
Ivan Godard (ivan.delete@this.millcomputing.com) on April 20, 2015 5:08 pm wrote:
> At that level the Mill is very SOC-like - lots of quasi-independent and quasi-asynchronous
> pieces, all hooked together so as to keep flows going without buffering.
It's interesting to know though I wonder what consequences does this have on your timing budget, more on this later.
> The spiller and the scratchpad are separate units. The spiller only handles save/restore
> of state across call-like operations, whereas the scratchpad is a general read-write
> SRAM akin to a register file, used for long-lived data.
Knowing that the scratchpad is generic is an interesting data-point and it's probably going to open up some interesting possibilities on the compiler side (I assume it will never be used explicitly, scratchpads and other explicitly-handled caches are not easy to use).
> Here you have veered from the actual architecture. The core does execute branches, potentially
> several of them in an instructions; there's a rule which determines the winner if more than one
> evaluates to taken, but otherwise branches are rather conventional.
Well, it is my understanding that the execution core itself is pure dataflow so there's no difference between a conditional and a branch provided that the instruction fetch is being handled outside of it. But maybe I'm not understanding this part very clearly.
> Branches are not predicted
> taken/untaken; instead there is in effect a BTB entry for each Extended Basic Block, which gives
> the predicted exit target. It does not matter how many branches are in the block's instructions,
> we predict where (and when) control will leave the block without caring how that happens.
Yeah, my assumption that the execution of an instruction would yield a single target address which is the exit point. An instruction containing divergent control-flow might have more than one and "executing the branches" inside of it effectively chooses the right address and sends it to the instruction fetch machinery.
> Yes: instructions (can) contain short dataflows rather than individual actions.
It is a dataflow machine after all :)
> There is a TLB-like structure that handles protection, the PLB. However checking can be
> done in parallel with the load, rather than in front of it as in a conventional TLB.
Yeah, what I meant was that conventional paged virtual memory would have required virtual-to-physical translation in the AGUs and that in turn would have required a TLB. If I understand it correctly the PLB offers access control but doesn't do paged virtual-to-physical mappings.
> It is :-)
I was almost tempted of describing the execution core as "ALUs, muxes and latches" because it really sounds that way.
> Not really. The Mill uses control-flow prediction like any modern machine, and can expect
> to get the same prediction quality (for any given predictor algorithm) as other CPUs, without
> special compilation. The major difference is the cost when the predictor gets it wrong: five
> cycles (when the cache contains the correct target), rather than the 15-20 typical.
But the Mill is very wide, so you're wasting far more execution resources in those cycles than a conventional machine.
> The machine is very wide, which lends itself to speculative execution down multiple execution paths while replacing
> control flow with if-conversion. The same methods are routine today when the architecture supports predication,
> a selection op, or a conditional move, all of which are common; the Mill differs only in the degree that you
> can trade performance against the power consumption of speculation.
One interesting aspect of the Mill is that predication is probably less costly than on other architectures. A predicated instruction whose result is not used wastes power to fetch, decode, dispatch, execute and retire (plus tracking and everything that goes along, register fetches, renaming, etc...). In your case you should pay almost only the cost of fetching and executing it.
> LTO and similar methods will have the same
> advantages - and drawbacks - as other CPUss, and are certainly not required on a Mill.
I don't think so: the Mill is far too wide for client codes that are very irregular and offer little TLP. Besides the Mill doesn't have a way to execute stuff in the shadow of a cache miss (if the relevant advanced load couldn't be scheduled soon enough which will be the case very often, more on this later too).
> MLP is an issue for any architecture, especially when there are chained memory dependencies as in a linked
> list. The Mill will be slow on a list, but no slower than anybody else - we all need to know the next
> address before we can fetch from it, so there's no MLP for anyone.
That's not exactly true. An OoOE machine with the amount of reordering of L/S operations that modern desktop and server cores have can walk more than one irregular structure at the same time. Let's say you're retrieving elements from multiple short lists or trees (that's the kind of thing that happens practically continuously in a browser or in a compiler). There's no compiler-exploitable dependency in such code because of pointer-to-pointer dependencies and possibly control flow too yet an OoOE machine predicting the branches correctly can start walking those structures in parallel even taking multiple misses before stopping, provided it can disambiguate the memory operations in the presence of stores.
> In structures where there is the possibility
> of MLP (say iterating over an array) then the Mill can keep the memory pipes full, although only part
> of the mechanism to do that has been publicly described; the rest awaits filing patents.
I suppose all prefetching techniques - both traditional and advanced - can be reused in the Mill.
> LLVM didn't exist when we started the Mill, and gcc was pretty much unusable, so we rolled our own using
> the EDG front end (also used by ICC and many other compilers, and a terrific product we highly recommend)
> and our own middle and back end. We got it to the point of handling simple things, despite having to track
> a *rapidly* moving target, but realized that we would be better off in an ecosystem once the machine design
> stabilized enough and mostly-suitable ecosystems became available. We selected LLVM, and are hard at work
> on the port, and are again able to handle simple things, but the effort is non-trivial.
I'm not fully aware of the timeline of the Mill's design but again having a rapidly changing hardware design without a compiler to test it is still quite surprising to me. I'd imagine that most of the iterations in the design of a completely new architecture would happen in a simulator in response to executing real code traces. In your case it sounds like your team was more hardware-heavy and most of the design happened there. In many cases that didn't prove to be a good thing in the past.
> At that level the Mill is very SOC-like - lots of quasi-independent and quasi-asynchronous
> pieces, all hooked together so as to keep flows going without buffering.
It's interesting to know though I wonder what consequences does this have on your timing budget, more on this later.
> The spiller and the scratchpad are separate units. The spiller only handles save/restore
> of state across call-like operations, whereas the scratchpad is a general read-write
> SRAM akin to a register file, used for long-lived data.
Knowing that the scratchpad is generic is an interesting data-point and it's probably going to open up some interesting possibilities on the compiler side (I assume it will never be used explicitly, scratchpads and other explicitly-handled caches are not easy to use).
> Here you have veered from the actual architecture. The core does execute branches, potentially
> several of them in an instructions; there's a rule which determines the winner if more than one
> evaluates to taken, but otherwise branches are rather conventional.
Well, it is my understanding that the execution core itself is pure dataflow so there's no difference between a conditional and a branch provided that the instruction fetch is being handled outside of it. But maybe I'm not understanding this part very clearly.
> Branches are not predicted
> taken/untaken; instead there is in effect a BTB entry for each Extended Basic Block, which gives
> the predicted exit target. It does not matter how many branches are in the block's instructions,
> we predict where (and when) control will leave the block without caring how that happens.
Yeah, my assumption that the execution of an instruction would yield a single target address which is the exit point. An instruction containing divergent control-flow might have more than one and "executing the branches" inside of it effectively chooses the right address and sends it to the instruction fetch machinery.
> Yes: instructions (can) contain short dataflows rather than individual actions.
It is a dataflow machine after all :)
> There is a TLB-like structure that handles protection, the PLB. However checking can be
> done in parallel with the load, rather than in front of it as in a conventional TLB.
Yeah, what I meant was that conventional paged virtual memory would have required virtual-to-physical translation in the AGUs and that in turn would have required a TLB. If I understand it correctly the PLB offers access control but doesn't do paged virtual-to-physical mappings.
> It is :-)
I was almost tempted of describing the execution core as "ALUs, muxes and latches" because it really sounds that way.
> Not really. The Mill uses control-flow prediction like any modern machine, and can expect
> to get the same prediction quality (for any given predictor algorithm) as other CPUs, without
> special compilation. The major difference is the cost when the predictor gets it wrong: five
> cycles (when the cache contains the correct target), rather than the 15-20 typical.
But the Mill is very wide, so you're wasting far more execution resources in those cycles than a conventional machine.
> The machine is very wide, which lends itself to speculative execution down multiple execution paths while replacing
> control flow with if-conversion. The same methods are routine today when the architecture supports predication,
> a selection op, or a conditional move, all of which are common; the Mill differs only in the degree that you
> can trade performance against the power consumption of speculation.
One interesting aspect of the Mill is that predication is probably less costly than on other architectures. A predicated instruction whose result is not used wastes power to fetch, decode, dispatch, execute and retire (plus tracking and everything that goes along, register fetches, renaming, etc...). In your case you should pay almost only the cost of fetching and executing it.
> LTO and similar methods will have the same
> advantages - and drawbacks - as other CPUss, and are certainly not required on a Mill.
I don't think so: the Mill is far too wide for client codes that are very irregular and offer little TLP. Besides the Mill doesn't have a way to execute stuff in the shadow of a cache miss (if the relevant advanced load couldn't be scheduled soon enough which will be the case very often, more on this later too).
> MLP is an issue for any architecture, especially when there are chained memory dependencies as in a linked
> list. The Mill will be slow on a list, but no slower than anybody else - we all need to know the next
> address before we can fetch from it, so there's no MLP for anyone.
That's not exactly true. An OoOE machine with the amount of reordering of L/S operations that modern desktop and server cores have can walk more than one irregular structure at the same time. Let's say you're retrieving elements from multiple short lists or trees (that's the kind of thing that happens practically continuously in a browser or in a compiler). There's no compiler-exploitable dependency in such code because of pointer-to-pointer dependencies and possibly control flow too yet an OoOE machine predicting the branches correctly can start walking those structures in parallel even taking multiple misses before stopping, provided it can disambiguate the memory operations in the presence of stores.
> In structures where there is the possibility
> of MLP (say iterating over an array) then the Mill can keep the memory pipes full, although only part
> of the mechanism to do that has been publicly described; the rest awaits filing patents.
I suppose all prefetching techniques - both traditional and advanced - can be reused in the Mill.
> LLVM didn't exist when we started the Mill, and gcc was pretty much unusable, so we rolled our own using
> the EDG front end (also used by ICC and many other compilers, and a terrific product we highly recommend)
> and our own middle and back end. We got it to the point of handling simple things, despite having to track
> a *rapidly* moving target, but realized that we would be better off in an ecosystem once the machine design
> stabilized enough and mostly-suitable ecosystems became available. We selected LLVM, and are hard at work
> on the port, and are again able to handle simple things, but the effort is non-trivial.
I'm not fully aware of the timeline of the Mill's design but again having a rapidly changing hardware design without a compiler to test it is still quite surprising to me. I'd imagine that most of the iterations in the design of a completely new architecture would happen in a simulator in response to executing real code traces. In your case it sounds like your team was more hardware-heavy and most of the design happened there. In many cases that didn't prove to be a good thing in the past.