By: RichardC (tich.delete@this.pobox.com), April 19, 2017 1:22 pm
Room: Moderated Discussions
Megol (golem960.delete@this.gmail.com) on April 19, 2017 10:02 am wrote:
> For your claim to be correct there have to be new revolutionary physics involved that can make
> O(n^2) structures scale like O(n). Because the other optimizations that are used can also be
> used in a processor that doesn't use OoO execution - only making them more efficient.
No. My claim isn't about scaling with n - it's about the absolute power consumption in current
chips, and whether the parts of the power consumption that the Mill claims to eliminate or
reduce really add up to a large percentage, on the kind of code that the Mill appears to be
suited to.
> The Mill attempts to remove explicit register files, expose the bypass networks. That alone
> (if it works) will reduce power consumption for data delivery.
Removing *explicit* register files isn't the same as removing the power consumption of physical
register files. The idea that static scheduling will magically make everything appear in the right
place at the right time without ever needing any registers is - well, I spent 12 years working
on Ikos/Mentor's fpga-based simulator which was a refrigerator-sized box full of fpgas running
a completely static-scheduled model, and you end up needing a heck of a lot of registers all over
the place to make it work - so I'm not buying that justification. Plus the global static scheduling
problem was decidedly time-consuming.
> There are no instruction schedulers,
> ROB or load-store queues of the type used in OoO processors.
If it doesn't have load-store queues, it's doomed. The memory hierarchy isn't what it used to
be in 1990, and the Mill is going to have to deal with that if it ever gets into hardware.
> There are chances that the cache
> design will reduce power too, however the public details aren't enough to tell ATM. Branch prediction
> is claimed to be simplified as it uses exit prediction of larger blocks.
I'll bet $20 that "simpler" ends up meaning "worse", and bad predictions are going to hurt a lot.
> Now it is reasonable to be skeptical to the claims about the Mill (as history have shown that even enthusiastic
> and skilled designers with a lot of cash available can fail to make statically scheduled high-ILP designs work
> well in practice) but the reality is that standard processors aren't efficient. They are more efficient than
> designs not geared towards power consumption, sure, but the fundamental limits are still there.
I think that's only true if you ignore the long-known reality that most of computing is about
moving data around and storing it, not processing it. The supposed "inefficiency" of an OoO
cpu is about taking care to get the right data to the right place as early as possible subject
to dependencies and resource constraints. And the alternative is to *not* get the right data to
the right place as early as possible, which might well use less power, but also translates
directly to running slower.
The gearbox in my minivan is "inefficient", in that it "wastes" some power. But if you take it
out and connect the engine direct to the wheels, you don't get a faster car. It's necessary,
and there isn't a better alternative (yet).
>
> > And the evolution of core clockspeeds and DRAM latency has made MLP
> > more important than in 2003 - and it seems fairly clear that OoO can
> > give moe MLP across a wider range of code than the static-scheduled Mill.
>
> Why? Standard OoO processors are actually not especially MLP friendly, they aren't designed to be.
Yes they are. They will issue multiple memory loads as early as possible, and even speculatively,
given a considerable amount of MLP even in code which hasn't been carefully written to give MLP.
>
> > > Even modern power-optimized processors expend a lot of power for doing calculations using very
> > > little power. Whether the Mill can change that is unclear but at least they are trying.
> >
> > I'm very skeptical about whether that's true. If you're using SIMD fused-multiply-add
> > on AVX2, you're getting 16 single-precision FLOPS for each instruction's worth of
> > OoO gizmology, and I'd expect the power usage relating to renaming and other OoO
> > stuff is insignificant. If it's AVX-512, then that's 32 SP FLOPS.
>
> Of course it is true. The majority of power is wasted on overheads that comes from all direction,
> doing a floating point operation is cheap power wise - feeding the FP unit isn't.
But it isn't "overhead" to be storing data in fast memory close to the FPU (whether you call it
"registers" or "scratchpad") and/or loading operands from memory and storing results somewhere,
all of which are essential to doing an FP calculation, and will be reproduced in some very
similar form in the Mill.
> And you think that the overheads suddenly evaporate when running AVX2 code? Only
> the instruction fetch and scheduling overheads are reduced while data feed overheads
> are increased in addition to the consumption of the extra FP units.
Yes, but the data movement is an unavoidable part of doing the computation, and the Mill
doesn't make that any easier or more efficient. In cranking-hard AVX2 code, the power usage
related to the OoO bookkeeping for register renaming and ROB is small compared to the power
used in (unavoidably) moving data to/from the SIMD fpu's, doing the operations, and moving
data back. Because you've got a heck of a lot of data, and a heck of a lot of FLOPS, for each
instruction and for each register.
Hypothetically, if the OoO gizmology was costing you 80% of power on non-SIMD 64bit code,
then on AVX-512 code the ratio would go from 80:20 to 80:160, and the OoO stuff would only be
33% of the power. [But I bet it wasn't as high as 80% to start with].
> For your claim to be correct there have to be new revolutionary physics involved that can make
> O(n^2) structures scale like O(n). Because the other optimizations that are used can also be
> used in a processor that doesn't use OoO execution - only making them more efficient.
No. My claim isn't about scaling with n - it's about the absolute power consumption in current
chips, and whether the parts of the power consumption that the Mill claims to eliminate or
reduce really add up to a large percentage, on the kind of code that the Mill appears to be
suited to.
> The Mill attempts to remove explicit register files, expose the bypass networks. That alone
> (if it works) will reduce power consumption for data delivery.
Removing *explicit* register files isn't the same as removing the power consumption of physical
register files. The idea that static scheduling will magically make everything appear in the right
place at the right time without ever needing any registers is - well, I spent 12 years working
on Ikos/Mentor's fpga-based simulator which was a refrigerator-sized box full of fpgas running
a completely static-scheduled model, and you end up needing a heck of a lot of registers all over
the place to make it work - so I'm not buying that justification. Plus the global static scheduling
problem was decidedly time-consuming.
> There are no instruction schedulers,
> ROB or load-store queues of the type used in OoO processors.
If it doesn't have load-store queues, it's doomed. The memory hierarchy isn't what it used to
be in 1990, and the Mill is going to have to deal with that if it ever gets into hardware.
> There are chances that the cache
> design will reduce power too, however the public details aren't enough to tell ATM. Branch prediction
> is claimed to be simplified as it uses exit prediction of larger blocks.
I'll bet $20 that "simpler" ends up meaning "worse", and bad predictions are going to hurt a lot.
> Now it is reasonable to be skeptical to the claims about the Mill (as history have shown that even enthusiastic
> and skilled designers with a lot of cash available can fail to make statically scheduled high-ILP designs work
> well in practice) but the reality is that standard processors aren't efficient. They are more efficient than
> designs not geared towards power consumption, sure, but the fundamental limits are still there.
I think that's only true if you ignore the long-known reality that most of computing is about
moving data around and storing it, not processing it. The supposed "inefficiency" of an OoO
cpu is about taking care to get the right data to the right place as early as possible subject
to dependencies and resource constraints. And the alternative is to *not* get the right data to
the right place as early as possible, which might well use less power, but also translates
directly to running slower.
The gearbox in my minivan is "inefficient", in that it "wastes" some power. But if you take it
out and connect the engine direct to the wheels, you don't get a faster car. It's necessary,
and there isn't a better alternative (yet).
>
> > And the evolution of core clockspeeds and DRAM latency has made MLP
> > more important than in 2003 - and it seems fairly clear that OoO can
> > give moe MLP across a wider range of code than the static-scheduled Mill.
>
> Why? Standard OoO processors are actually not especially MLP friendly, they aren't designed to be.
Yes they are. They will issue multiple memory loads as early as possible, and even speculatively,
given a considerable amount of MLP even in code which hasn't been carefully written to give MLP.
>
> > > Even modern power-optimized processors expend a lot of power for doing calculations using very
> > > little power. Whether the Mill can change that is unclear but at least they are trying.
> >
> > I'm very skeptical about whether that's true. If you're using SIMD fused-multiply-add
> > on AVX2, you're getting 16 single-precision FLOPS for each instruction's worth of
> > OoO gizmology, and I'd expect the power usage relating to renaming and other OoO
> > stuff is insignificant. If it's AVX-512, then that's 32 SP FLOPS.
>
> Of course it is true. The majority of power is wasted on overheads that comes from all direction,
> doing a floating point operation is cheap power wise - feeding the FP unit isn't.
But it isn't "overhead" to be storing data in fast memory close to the FPU (whether you call it
"registers" or "scratchpad") and/or loading operands from memory and storing results somewhere,
all of which are essential to doing an FP calculation, and will be reproduced in some very
similar form in the Mill.
> And you think that the overheads suddenly evaporate when running AVX2 code? Only
> the instruction fetch and scheduling overheads are reduced while data feed overheads
> are increased in addition to the consumption of the extra FP units.
Yes, but the data movement is an unavoidable part of doing the computation, and the Mill
doesn't make that any easier or more efficient. In cranking-hard AVX2 code, the power usage
related to the OoO bookkeeping for register renaming and ROB is small compared to the power
used in (unavoidably) moving data to/from the SIMD fpu's, doing the operations, and moving
data back. Because you've got a heck of a lot of data, and a heck of a lot of FLOPS, for each
instruction and for each register.
Hypothetically, if the OoO gizmology was costing you 80% of power on non-SIMD 64bit code,
then on AVX-512 code the ratio would go from 80:20 to 80:160, and the OoO stuff would only be
33% of the power. [But I bet it wasn't as high as 80% to start with].