By: RichardC (tich.delete@this.pobox.com), April 17, 2017 8:29 am
Room: Moderated Discussions
The other way of looking at this is that any cpu has chunks of hardware
which function in a non-data-dependent cycle-predictable way, and within
a domain of completely cycle-predictable hardware you can do static
scheduling of everything.
In an early-1980s RISC, everything was cycle-predictable except for
the memory hierarchy (cache hit vs miss and/or DRAM locality) and
conditional branches, so static-scheduled in-order worked ok, and all
hardware in the cycle-predictable domain (registers, ALU, control logic)
was stalled when necessary.
In a modern machine with OoO, there are many different caches with
not-cycle-predictable behavior (TLB's, L1, L2, maybe shared L3, multiple
DRAM channels) due to cache hit/miss and contention with other cores
and other system traffic. So the cycle-predictable chunks are small
(individual ALU's and FPU's) and there's a lot of dynamic-schedule glue
to tie them together, and avoid stalls of critical resources whenever
possible.
The Mill (as far as I can tell) chooses to have a very large cycle-predictable
domain encompassing the belt, multiple ALU's/EUs, scratchpad SRAM, L1
(and in a multi-core system, possibly multiple copies of all that stuff
synchronized as a single cycle-predictable domain).
The problem is that when anything at the edge of that large cycle-predictable
domain hits a speed bump, the whole cycle-predictable domain has to be stalled.
And "speed bump" would seem to include L1 misses, and any kind of overflow or
resource-exhaustion condition.
Now as you put more and more stuff into that single cycle-predictable domain,
then each cycle of stall becomes more and more expensive. And as the Mill
is trying to widen all resources, it's also trying to do more and more
potentially-stall-causing operations on each cycle.
This leads to an inevitable trainwreck on any problem where the miss-rate
on potentially-stall-causing operations is more than minuscule.
Now I daresay they'll try to keep that under control to some extent by allowing
multiple misses to be handled in parallel. But since different operations
can miss by a different number of cycles, at that point you end up having to
build much of a dynamic-scheduled-OoO-but-keeps-in-order-semantics mechanism
(probably including everything they were trying to avoid, such as renaming,
but for "load buffers" rather than "registers"). Which then poses the question
of whether a massively complicated reinvention of the static-scheduled compiler
actually bought you anything at all, since you ended up with a complicated compiler
*and* hardware OoO, and on each side it's still half-assed compared to conventional
OoO (except in the - not at all general - case where miss rates are tiny).
But it gets worse. Because anything with reconvergent control flow (if-then-else)
has not-cycle-predictable behavior - unless you pessimize it to always take the
time of the longest-possible branch. And the reconvergent control flow may be
hidden inside a procedure call. Or arbitrarily deep inside a recursive procedure
call, or mutually recursive procedure calls. How does the Mill feel about recursive
calls to manipulate arbitrarily-deep lists and trees ? Not good, I'd bet. Or callbacks
from precompiled libraries into user-supplied observer methods/functions ?
And then the programs which *do* exhibit the kind of highly-predictable low-miss-rate
behavior the Mill wants are largely in scientific computing - but for those a) since the
original wave of VLIWs, x86 has acquired fast/wide SIMD floating-point, so that it can
go about as fast as the DRAM bandwidth and FPU power allow; and b) the Mill claims a
big power efficiency win, but I doubt that can materialize on programs where the
FPU's are taking a big fraction of the power.
So integer programs are too branchy and pointer-chasy to fit the Mill; and floating-point
programs work fine (bounded by FPU throughput and efficiency) on x86/AVX-2/AVX-512 or
GPGPUs. What is the Mill *for* (even if it works, which I greatly doubt) ?
which function in a non-data-dependent cycle-predictable way, and within
a domain of completely cycle-predictable hardware you can do static
scheduling of everything.
In an early-1980s RISC, everything was cycle-predictable except for
the memory hierarchy (cache hit vs miss and/or DRAM locality) and
conditional branches, so static-scheduled in-order worked ok, and all
hardware in the cycle-predictable domain (registers, ALU, control logic)
was stalled when necessary.
In a modern machine with OoO, there are many different caches with
not-cycle-predictable behavior (TLB's, L1, L2, maybe shared L3, multiple
DRAM channels) due to cache hit/miss and contention with other cores
and other system traffic. So the cycle-predictable chunks are small
(individual ALU's and FPU's) and there's a lot of dynamic-schedule glue
to tie them together, and avoid stalls of critical resources whenever
possible.
The Mill (as far as I can tell) chooses to have a very large cycle-predictable
domain encompassing the belt, multiple ALU's/EUs, scratchpad SRAM, L1
(and in a multi-core system, possibly multiple copies of all that stuff
synchronized as a single cycle-predictable domain).
The problem is that when anything at the edge of that large cycle-predictable
domain hits a speed bump, the whole cycle-predictable domain has to be stalled.
And "speed bump" would seem to include L1 misses, and any kind of overflow or
resource-exhaustion condition.
Now as you put more and more stuff into that single cycle-predictable domain,
then each cycle of stall becomes more and more expensive. And as the Mill
is trying to widen all resources, it's also trying to do more and more
potentially-stall-causing operations on each cycle.
This leads to an inevitable trainwreck on any problem where the miss-rate
on potentially-stall-causing operations is more than minuscule.
Now I daresay they'll try to keep that under control to some extent by allowing
multiple misses to be handled in parallel. But since different operations
can miss by a different number of cycles, at that point you end up having to
build much of a dynamic-scheduled-OoO-but-keeps-in-order-semantics mechanism
(probably including everything they were trying to avoid, such as renaming,
but for "load buffers" rather than "registers"). Which then poses the question
of whether a massively complicated reinvention of the static-scheduled compiler
actually bought you anything at all, since you ended up with a complicated compiler
*and* hardware OoO, and on each side it's still half-assed compared to conventional
OoO (except in the - not at all general - case where miss rates are tiny).
But it gets worse. Because anything with reconvergent control flow (if-then-else)
has not-cycle-predictable behavior - unless you pessimize it to always take the
time of the longest-possible branch. And the reconvergent control flow may be
hidden inside a procedure call. Or arbitrarily deep inside a recursive procedure
call, or mutually recursive procedure calls. How does the Mill feel about recursive
calls to manipulate arbitrarily-deep lists and trees ? Not good, I'd bet. Or callbacks
from precompiled libraries into user-supplied observer methods/functions ?
And then the programs which *do* exhibit the kind of highly-predictable low-miss-rate
behavior the Mill wants are largely in scientific computing - but for those a) since the
original wave of VLIWs, x86 has acquired fast/wide SIMD floating-point, so that it can
go about as fast as the DRAM bandwidth and FPU power allow; and b) the Mill claims a
big power efficiency win, but I doubt that can materialize on programs where the
FPU's are taking a big fraction of the power.
So integer programs are too branchy and pointer-chasy to fit the Mill; and floating-point
programs work fine (bounded by FPU throughput and efficiency) on x86/AVX-2/AVX-512 or
GPGPUs. What is the Mill *for* (even if it works, which I greatly doubt) ?