By: Maynard Handley (name99.delete@this.name99.org), August 17, 2014 11:24 am
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2014 7:17 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 6:35 pm wrote:
> > Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2014 5:43 pm wrote:
> >
> > > There's no support for 286 mode. 286 protected mode, fortunately, did not carry on to the 80386.
> > >
> >
> > So are you saying that I could not just run OS/2 on a modern Intel CPU? When did
> > that become true? Your answer suggests that it was true even with the 386, but that
> > surely can't be right. Didn't IBM have OS/2 running on the 386 based PS/2's?
>
> OS/2 ran on a 386, but not by treating it as a 286. It had different code paths for the 2 cases.
>
> DOS/Windows, of course, had 3: 8086, 80286 and 80386.
>
> > > In it's use of destructive operations, requiring extra mov reg,reg operations (some
> > > times, lots of them), which take energy and execution resources on every x86 CPU.
> > > Again, that penalty can be greatly reduced by eliminating them at
> > > rename (Ivy Bridge onwards, next AMD high end cores too IIRC).
> >
> > I'm surprised it took till IB to do this. I'd have thought it would have come in at Nehalem, if not sooner.
> > FWIW Cyclone also does this (and since there's a register that's kinda/sorta dedicated to being
> > zero, it can also set to zero at rename. Obviously x86 has its preferred idiom for zeroing
> > which is recognized by the decoder, but I don't know if it's also handled at rename.)
>
> You're mixing up things.
> One the zeroing idioms like xor eax, eax. Those have been supported for ages and they're common in OoO CPUs.
>
> Another thing is the conversion of sequences such
> "mov ebx, eax; add edx, eax; add ecx, ebx" into "add edx'', edx', eax'; add exc'', ecx',
> eax'" at the rename stage (the ' and '' indicate physical renamed registers).
> No RISC does this, AFAIK, because pure and simply there's no need, unless the compiler is brain dead.
>
> >
> > Intel have, to some extent, worked around the register problem with op fusion. I've mentioned that IBM have
>
> Fused or not, load-op reduces the issue of having few architectural registers.
>
> Fusion helps with reducing the number of µops which need to be tracked in the OoO machinery.
> And having load-op in the ISA does make it (much) easier to perform such fusion.
I elided a number of related items together; that's not the same thing as mixing them up. I thought my points were obvious but I guess I need to be more explicit.
My points were
(a) x86 as an ISA suffers from a lack of LOGICAL registers less than would be the case if it were a load-store architecture because mem+op instructions utilize an implicit register.
(b) x86 when implemented using fused operations suffers from a lack of PHYSICAL registers less than would be the case if it were a load-store architecture because the implicit LOGICAL (explicit PHYSICAL) register utilized by mem+op instructions has the POTENTIAL to be released immediately after use rather than having to be locked up via the ROB until the fused instruction completes. (In the ideal case, no register would even need to be allocated, the operation could simply be delayed by the appropriate number of cycles and the value grabbed off the bypass network. However that would require replaying the instruction after a cache miss and could be messy.)
This sort of thing --- the ability to release registers early --- is the primary win in the idea of generic mini-graph processing. As I said, the point of those ideas is to give generic RISC architectures the wins that come from op fusion in a way that is bwd and fwd compatible, much more general than the few patterns that Intel uses, and (in the most general case) that can be optimized by the compiler for each piece of code, rather than restricted to what the core implementer thought were the most useful patterns.
(c) Cyclone DOES have zero cycle movs and zeroing.
From the LLVM code we have:
/// Cyclone has register move instructions which are "free".
def FeatureZCRegMove : SubtargetFeature;
/// Cyclone has instructions which zero registers for "free".
def FeatureZCZeroing : SubtargetFeature;
I don't know what you thought I meant by this, but how I interpret what Cyclone is doing is that movs (and, as I said, zeroing as a variant on mov) are handled completely at the rename stage. After the rename, the op is done.
Now there are details as to EXACTLY how this might be done, and we don't know those. Obviously a dispatch slot and an execute stage are saved, which helps with power.
Possibly the instruction doesn't go into the issue queue, which means the issue queue is effectively slightly larger.
Depending on exactly how the rollback is performed in the case of branch misprediction, the instruction may also not need a slot in the ROB queue, meaning it can be completely discarded at rename.
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 6:35 pm wrote:
> > Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2014 5:43 pm wrote:
> >
> > > There's no support for 286 mode. 286 protected mode, fortunately, did not carry on to the 80386.
> > >
> >
> > So are you saying that I could not just run OS/2 on a modern Intel CPU? When did
> > that become true? Your answer suggests that it was true even with the 386, but that
> > surely can't be right. Didn't IBM have OS/2 running on the 386 based PS/2's?
>
> OS/2 ran on a 386, but not by treating it as a 286. It had different code paths for the 2 cases.
>
> DOS/Windows, of course, had 3: 8086, 80286 and 80386.
>
> > > In it's use of destructive operations, requiring extra mov reg,reg operations (some
> > > times, lots of them), which take energy and execution resources on every x86 CPU.
> > > Again, that penalty can be greatly reduced by eliminating them at
> > > rename (Ivy Bridge onwards, next AMD high end cores too IIRC).
> >
> > I'm surprised it took till IB to do this. I'd have thought it would have come in at Nehalem, if not sooner.
> > FWIW Cyclone also does this (and since there's a register that's kinda/sorta dedicated to being
> > zero, it can also set to zero at rename. Obviously x86 has its preferred idiom for zeroing
> > which is recognized by the decoder, but I don't know if it's also handled at rename.)
>
> You're mixing up things.
> One the zeroing idioms like xor eax, eax. Those have been supported for ages and they're common in OoO CPUs.
>
> Another thing is the conversion of sequences such
> "mov ebx, eax; add edx, eax; add ecx, ebx" into "add edx'', edx', eax'; add exc'', ecx',
> eax'" at the rename stage (the ' and '' indicate physical renamed registers).
> No RISC does this, AFAIK, because pure and simply there's no need, unless the compiler is brain dead.
>
> >
> > Intel have, to some extent, worked around the register problem with op fusion. I've mentioned that IBM have
>
> Fused or not, load-op reduces the issue of having few architectural registers.
>
> Fusion helps with reducing the number of µops which need to be tracked in the OoO machinery.
> And having load-op in the ISA does make it (much) easier to perform such fusion.
I elided a number of related items together; that's not the same thing as mixing them up. I thought my points were obvious but I guess I need to be more explicit.
My points were
(a) x86 as an ISA suffers from a lack of LOGICAL registers less than would be the case if it were a load-store architecture because mem+op instructions utilize an implicit register.
(b) x86 when implemented using fused operations suffers from a lack of PHYSICAL registers less than would be the case if it were a load-store architecture because the implicit LOGICAL (explicit PHYSICAL) register utilized by mem+op instructions has the POTENTIAL to be released immediately after use rather than having to be locked up via the ROB until the fused instruction completes. (In the ideal case, no register would even need to be allocated, the operation could simply be delayed by the appropriate number of cycles and the value grabbed off the bypass network. However that would require replaying the instruction after a cache miss and could be messy.)
This sort of thing --- the ability to release registers early --- is the primary win in the idea of generic mini-graph processing. As I said, the point of those ideas is to give generic RISC architectures the wins that come from op fusion in a way that is bwd and fwd compatible, much more general than the few patterns that Intel uses, and (in the most general case) that can be optimized by the compiler for each piece of code, rather than restricted to what the core implementer thought were the most useful patterns.
(c) Cyclone DOES have zero cycle movs and zeroing.
From the LLVM code we have:
/// Cyclone has register move instructions which are "free".
def FeatureZCRegMove : SubtargetFeature;
/// Cyclone has instructions which zero registers for "free".
def FeatureZCZeroing : SubtargetFeature;
I don't know what you thought I meant by this, but how I interpret what Cyclone is doing is that movs (and, as I said, zeroing as a variant on mov) are handled completely at the rename stage. After the rename, the op is done.
Now there are details as to EXACTLY how this might be done, and we don't know those. Obviously a dispatch slot and an execute stage are saved, which helps with power.
Possibly the instruction doesn't go into the issue queue, which means the issue queue is effectively slightly larger.
Depending on exactly how the rollback is performed in the case of branch misprediction, the instruction may also not need a slot in the ROB queue, meaning it can be completely discarded at rename.