By: Jacob Marley (jmarley123.delete@this.hotmail.com), April 20, 2017 10:14 pm
Room: Moderated Discussions
RichardC (tich.delete@this.pobox.com) on April 20, 2017 1:10 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on April 20, 2017 12:15 pm wrote:
>
> > I'm not sure what your argument is here.
> > First you argue that OoO is not a large percentage of power consumption which is wrong.
>
> OoO is not a large percentage of power consumption on vectorizable (AVX2) code, because
> that does a heck of a lot of OPS per register-rename and per instruction. And non-
> vectorizable code has either data dependencies or unpredictable control flow which will
> be (ahem) challenging for the Mill. On the stuff the Mill *might* do fast, it won't be
> saving much power relative to an AVX2-capable cpu; on the stuff where it might save power,
> it will be doing so by going slow.
>
> > Now you agree that it costs power, but because it's the best way to get performance
> > right now it is somehow impossible that any other approach could work?
>
> No, just that the Mill's approach of static scheduling was tried in the 1990s and
> failed badly, and everything that has changed since then has been in a direction which
> favors OoO over static-scheduling. So this approach is a loser.
>
> There could be some other approach, maybe. But it's a problem that lots of smart people
> have worked on for 25+ years, with the prospect of billions of dollars for success. So the
> smart bet is that the refinement of OoO will continue to prevail until the technological
> constraints change the game.
> >
> > > > Regardless of process improvements and clock/power gating bringing down absolute power consumption,
> > > > which can also be applied to any other architecture by the way, OoOE hardware still uses
> > > > quite a bit of power. Megol's point was that scaling didn't suddenly improve, if you want
> > > > to wider that power goes up fast. Sure, we can deal with more of that thanks to aforementioned
> > > > improvements, but eliminating it altogether would be a huge improvement.
>
> Yes, but who said you want to go wider for general-purpose computing ? If the program
> doesn't have enough ILP, going wider is just a waste of transistors and power. And many
> programs don't (even after speculative OoO has mined as much parallelism as it can find).
>
> > "would appear to be" is the important part. It's not like they haven't tried to adress
> > the shortcomings of static scheduling. I don't think anyone can accurately predict
> > how well it'll work until we've seen an FPGA implementation run actual code.
>
> I can't "accurately predict" its performance. Just that it's going to lose out to AVX2
> on vectorizable code, and it's also going to lose out to x86 OoO on anything which
> traverse data structures (hashtables, trees) or has a lot of unpredictable branching
> (which includes almost anything written in an object-oriented style).
How is an OoO better at virtual function calls, hash tables, trees?
Don't all of these come to down to you are waiting for two memory accesses where you don't know the 2nd until you have done the 1st?
If the tricks they have provided (single instruction function calls, unrolling/parallelizing loops and tail recursion) work, performance will be close to OoO (IF the cache/memory hierarchies hold up) with significantly lower power.
From the talks given up to now, I still have a cautiously open mind.
> anon (spam.delete.delete@this.this.spam.com) on April 20, 2017 12:15 pm wrote:
>
> > I'm not sure what your argument is here.
> > First you argue that OoO is not a large percentage of power consumption which is wrong.
>
> OoO is not a large percentage of power consumption on vectorizable (AVX2) code, because
> that does a heck of a lot of OPS per register-rename and per instruction. And non-
> vectorizable code has either data dependencies or unpredictable control flow which will
> be (ahem) challenging for the Mill. On the stuff the Mill *might* do fast, it won't be
> saving much power relative to an AVX2-capable cpu; on the stuff where it might save power,
> it will be doing so by going slow.
>
> > Now you agree that it costs power, but because it's the best way to get performance
> > right now it is somehow impossible that any other approach could work?
>
> No, just that the Mill's approach of static scheduling was tried in the 1990s and
> failed badly, and everything that has changed since then has been in a direction which
> favors OoO over static-scheduling. So this approach is a loser.
>
> There could be some other approach, maybe. But it's a problem that lots of smart people
> have worked on for 25+ years, with the prospect of billions of dollars for success. So the
> smart bet is that the refinement of OoO will continue to prevail until the technological
> constraints change the game.
> >
> > > > Regardless of process improvements and clock/power gating bringing down absolute power consumption,
> > > > which can also be applied to any other architecture by the way, OoOE hardware still uses
> > > > quite a bit of power. Megol's point was that scaling didn't suddenly improve, if you want
> > > > to wider that power goes up fast. Sure, we can deal with more of that thanks to aforementioned
> > > > improvements, but eliminating it altogether would be a huge improvement.
>
> Yes, but who said you want to go wider for general-purpose computing ? If the program
> doesn't have enough ILP, going wider is just a waste of transistors and power. And many
> programs don't (even after speculative OoO has mined as much parallelism as it can find).
>
> > "would appear to be" is the important part. It's not like they haven't tried to adress
> > the shortcomings of static scheduling. I don't think anyone can accurately predict
> > how well it'll work until we've seen an FPGA implementation run actual code.
>
> I can't "accurately predict" its performance. Just that it's going to lose out to AVX2
> on vectorizable code, and it's also going to lose out to x86 OoO on anything which
> traverse data structures (hashtables, trees) or has a lot of unpredictable branching
> (which includes almost anything written in an object-oriented style).
How is an OoO better at virtual function calls, hash tables, trees?
Don't all of these come to down to you are waiting for two memory accesses where you don't know the 2nd until you have done the 1st?
If the tricks they have provided (single instruction function calls, unrolling/parallelizing loops and tail recursion) work, performance will be close to OoO (IF the cache/memory hierarchies hold up) with significantly lower power.
From the talks given up to now, I still have a cautiously open mind.