By: Heikki Kultala (heikki.kultala.delete@this.tuni.fi), November 7, 2019 8:39 am
Room: Moderated Discussions
G. Boniface (boniface.delete@this.example.edu) on November 7, 2019 5:59 am wrote:
> anon.1 (abc.delete@this.def.com) on November 6, 2019 11:19 am wrote:
> > Seems like the RISC-V
> > folks recommend op-fusion, which is another thing I find ridiculous. The whole point of RISC was to make
> > decode simple. Now they want to add complexity in decode because, well, the ISA is oversimplified. Take
> > that idea further and a uopcache is the next logical step because you can't sustain dispatch bandwidth or
> > add extra pipe stages for fusion (it's not magic pixie dust, transistor gates have to be spent). Madness.
>
> Actually, it's not madness. RISC-V is very explicitly architected to allow a wide range of
> implementations, from "classic RISC" single-issue in-order pipelines to aggressive OoO.
>
> The claim made (which I cannot verify first-hand, but seems plausible) is that once
> you've spent the transistors on the complex dependency tracking required to implement
> superscalar execution, then op-fusion comes at minimal incremental cost.
This sounds bogus, at least for OoOE processors.
The dependency tracking for superscalar happens in the BACKEND, in the OoOE execution engine.
The dependence tracking for op-fusion happens in the FRONTEND.
They cannot reasonably share any same hardware.
> The apparently contradictory claims can both be true because they apply to different
> points on the cost-performance spectrum. If you want simple decoding and a tight transistor
> budget, RISC-V (without op fusion) has it. If you want high performance and have a
> correspondingly lavish transistor budget, RISC-V (with op fusion) has it.
No user ever really wants "simple decoding". "Simple decoding" is only a way to achieve some other goal which the user may want, like small area or good power efficiency.
But in order to achieve this "simple decoding" RISC-V lacks other things that would help to achieve these goals. Just aiming for "simple decoding" is simply a bad and stupid trade-off.
And no, RISC-V with op fusion still won't have as high performance as ARMv64.
> While op fusion requires additional pipeline stages, the claim is that they are the same
> stages as are required for superscalar execution anyway, so the incremental cost is low.
Totally bogus claim. The op fusion needs an additional stage in the FRONTEND while superscalar execution adds a stage(s) in the backend.
> (On a tangent, I'm reminded of Linus's criticism of weak memory models on the grounds that once you've
> paid the cost for a high-performance memory subsystem, then stronger memory models come at minimal
> additional cost. So weaker models burden the programmer with no genuine compensating performance
> benefit because they only help performance on inherently low-performance implementations.)
Op fusion is totally different thing than memory ordering.
> anon.1 (abc.delete@this.def.com) on November 6, 2019 11:19 am wrote:
> > Seems like the RISC-V
> > folks recommend op-fusion, which is another thing I find ridiculous. The whole point of RISC was to make
> > decode simple. Now they want to add complexity in decode because, well, the ISA is oversimplified. Take
> > that idea further and a uopcache is the next logical step because you can't sustain dispatch bandwidth or
> > add extra pipe stages for fusion (it's not magic pixie dust, transistor gates have to be spent). Madness.
>
> Actually, it's not madness. RISC-V is very explicitly architected to allow a wide range of
> implementations, from "classic RISC" single-issue in-order pipelines to aggressive OoO.
>
> The claim made (which I cannot verify first-hand, but seems plausible) is that once
> you've spent the transistors on the complex dependency tracking required to implement
> superscalar execution, then op-fusion comes at minimal incremental cost.
This sounds bogus, at least for OoOE processors.
The dependency tracking for superscalar happens in the BACKEND, in the OoOE execution engine.
The dependence tracking for op-fusion happens in the FRONTEND.
They cannot reasonably share any same hardware.
> The apparently contradictory claims can both be true because they apply to different
> points on the cost-performance spectrum. If you want simple decoding and a tight transistor
> budget, RISC-V (without op fusion) has it. If you want high performance and have a
> correspondingly lavish transistor budget, RISC-V (with op fusion) has it.
No user ever really wants "simple decoding". "Simple decoding" is only a way to achieve some other goal which the user may want, like small area or good power efficiency.
But in order to achieve this "simple decoding" RISC-V lacks other things that would help to achieve these goals. Just aiming for "simple decoding" is simply a bad and stupid trade-off.
And no, RISC-V with op fusion still won't have as high performance as ARMv64.
> While op fusion requires additional pipeline stages, the claim is that they are the same
> stages as are required for superscalar execution anyway, so the incremental cost is low.
Totally bogus claim. The op fusion needs an additional stage in the FRONTEND while superscalar execution adds a stage(s) in the backend.
> (On a tangent, I'm reminded of Linus's criticism of weak memory models on the grounds that once you've
> paid the cost for a high-performance memory subsystem, then stronger memory models come at minimal
> additional cost. So weaker models burden the programmer with no genuine compensating performance
> benefit because they only help performance on inherently low-performance implementations.)
Op fusion is totally different thing than memory ordering.