By: Maynard Handley (name99.delete@this.name99.org), August 16, 2014 2:24 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on August 16, 2014 12:55 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 9:35 am wrote:
> >
> > As a technical matter, the changes to POWER ISA have all, as far as I know, been on
> > the OS side; mostly to give the hypervisor better control over paging. I think there's
> > been one change to the collection of user-level sync instructions. I agree with your
> > stance that ISA matters, but I don't think this aspect of POWER proves the point.
> >
>
> After Power4 IBM added plenty of stuff on FP side.
> First Altivec (Power5)
> Then 2-wide DPFP SIMD.
> Then (or was it in the same core as DPFP SIMD? I don't remember) 64 SIMD registers.
> At some point they also added DFP, but that's very specialized and, IMHO, done
> more for marketeering checkbox then due to real need of real customers.
> I don't remember what's going on encryption side. Everybody seem to have AES, does not Power has it too?
>
You're right. I forgot about the post AltiVec stuff you list (though it's all, IMHO, pretty trivial). I guess you could also add the various modified NOPs that they've added to allow setting the priority of first 4 and them 8 threads.
However (other people may have different opinions) the direction this has taken is basically meaningless. No-one is excited or impressed by the fact that adding an AES instruction to your CPU makes it run AES code faster; likewise the addition of wider vectors is not especially interesting. So what IS interesting?
IMHO:
- on the instruction side, adding vectors for the FIRST time is interesting, because that does require not completely trivial changes to the whole core. Depending on exactly how the core operates, the changes may be mostly obvious, but they can be more substantial --- for example a core that used to carry register VALUES in the instruction queue and ROB may be forced to switch to carry register identifiers, with knock-on effects to the rest of the core.
- instructions that substantially affect the load/store/memory/synchronization architecture are interesting. The double-loads of ARM don't really qualify for this (they're cute,but don't change how you'd design the whole core). Prefetching instructions could, maybe, qualify, if ANYONE had a version of them that didn't completely suck. The pattern here on both POWER and x86 has been that initial versions sucked, incompatible new versions were added which also sucked, and eventually everyone involved just gave up and went to invisible HW prefetching (which is probably the right way to do it). ARM, as far as I know, bypassed the "let's so it crappily in SW" stage and went straight to HW.
The handling of non-aligned loads would fall into this category except that, as far as I know, there are no great divergences here. POWER is probably too constrained with the vector loads (unless they have loosened up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
Likewise for synchronization primitives. The consensus as I read the literature is that load locked/store conditional is substantially easier to implement and get right than LOCK prefixes and the random mix of other things that Intel has. I'm guessing it's then also easier to build HW TM on top of load locked/store conditional.
Beyond this, I'm guessing it's substantially harder to design for and verify the Intel memory model than the looser POWER and ARM memory models.
- All in all, however, there's too much obsession in this argument about exact instructions. When I talk about it being hard to design an x86 CPU, I'm not saying this because it's hard to add a Decimal Adjust ASCII instruction, or even a REP MOV instruction. There's vastly more to x86 than just the instruction set. It's the fact that you have to support 8086 mode, and 286 mode, and virtual 8086 mode, and SMM and all the rest of it. It's ALL this stuff that I see as baggage. And yes, once you've done it once, at about the PPro level, you have a basis to work off for the future, but it's always imposing a tax. You can't, for example, COMPLETELY streamline your load/store path, even for x64, because you have to support FS and GS registers, so you need weird side branches to handle that.
I stated in a previous post that David and I maybe ultimately agreed that the x86 tax was worth about two development years, and that continues to me to seem like a good way to view it.
Does this mean two extra years to create an equivalent performing CPU, or that, *with the same sized team and same process*, an ARM or POWER device would lead an x86 device by about two years? I'd say, from the very limited evidence we have, both interpretations are reasonable. IF, for example (we'll see soon enough) an Apple A8 performs generally at the level of a Broadwell-Y (a flexible metric --- there's absolute single-threaded performance, multi-threaded performance, AVX-assisted FLOPS, performance/watt, GPU performance, dynamic range of performance etc --- but let's ignore the details for now) one could reasonably argue that the x86 complexity tax is the equivalent of about two years in process improvement.
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 9:35 am wrote:
> >
> > As a technical matter, the changes to POWER ISA have all, as far as I know, been on
> > the OS side; mostly to give the hypervisor better control over paging. I think there's
> > been one change to the collection of user-level sync instructions. I agree with your
> > stance that ISA matters, but I don't think this aspect of POWER proves the point.
> >
>
> After Power4 IBM added plenty of stuff on FP side.
> First Altivec (Power5)
> Then 2-wide DPFP SIMD.
> Then (or was it in the same core as DPFP SIMD? I don't remember) 64 SIMD registers.
> At some point they also added DFP, but that's very specialized and, IMHO, done
> more for marketeering checkbox then due to real need of real customers.
> I don't remember what's going on encryption side. Everybody seem to have AES, does not Power has it too?
>
You're right. I forgot about the post AltiVec stuff you list (though it's all, IMHO, pretty trivial). I guess you could also add the various modified NOPs that they've added to allow setting the priority of first 4 and them 8 threads.
However (other people may have different opinions) the direction this has taken is basically meaningless. No-one is excited or impressed by the fact that adding an AES instruction to your CPU makes it run AES code faster; likewise the addition of wider vectors is not especially interesting. So what IS interesting?
IMHO:
- on the instruction side, adding vectors for the FIRST time is interesting, because that does require not completely trivial changes to the whole core. Depending on exactly how the core operates, the changes may be mostly obvious, but they can be more substantial --- for example a core that used to carry register VALUES in the instruction queue and ROB may be forced to switch to carry register identifiers, with knock-on effects to the rest of the core.
- instructions that substantially affect the load/store/memory/synchronization architecture are interesting. The double-loads of ARM don't really qualify for this (they're cute,but don't change how you'd design the whole core). Prefetching instructions could, maybe, qualify, if ANYONE had a version of them that didn't completely suck. The pattern here on both POWER and x86 has been that initial versions sucked, incompatible new versions were added which also sucked, and eventually everyone involved just gave up and went to invisible HW prefetching (which is probably the right way to do it). ARM, as far as I know, bypassed the "let's so it crappily in SW" stage and went straight to HW.
The handling of non-aligned loads would fall into this category except that, as far as I know, there are no great divergences here. POWER is probably too constrained with the vector loads (unless they have loosened up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
Likewise for synchronization primitives. The consensus as I read the literature is that load locked/store conditional is substantially easier to implement and get right than LOCK prefixes and the random mix of other things that Intel has. I'm guessing it's then also easier to build HW TM on top of load locked/store conditional.
Beyond this, I'm guessing it's substantially harder to design for and verify the Intel memory model than the looser POWER and ARM memory models.
- All in all, however, there's too much obsession in this argument about exact instructions. When I talk about it being hard to design an x86 CPU, I'm not saying this because it's hard to add a Decimal Adjust ASCII instruction, or even a REP MOV instruction. There's vastly more to x86 than just the instruction set. It's the fact that you have to support 8086 mode, and 286 mode, and virtual 8086 mode, and SMM and all the rest of it. It's ALL this stuff that I see as baggage. And yes, once you've done it once, at about the PPro level, you have a basis to work off for the future, but it's always imposing a tax. You can't, for example, COMPLETELY streamline your load/store path, even for x64, because you have to support FS and GS registers, so you need weird side branches to handle that.
I stated in a previous post that David and I maybe ultimately agreed that the x86 tax was worth about two development years, and that continues to me to seem like a good way to view it.
Does this mean two extra years to create an equivalent performing CPU, or that, *with the same sized team and same process*, an ARM or POWER device would lead an x86 device by about two years? I'd say, from the very limited evidence we have, both interpretations are reasonable. IF, for example (we'll see soon enough) an Apple A8 performs generally at the level of a Broadwell-Y (a flexible metric --- there's absolute single-threaded performance, multi-threaded performance, AVX-assisted FLOPS, performance/watt, GPU performance, dynamic range of performance etc --- but let's ignore the details for now) one could reasonably argue that the x86 complexity tax is the equivalent of about two years in process improvement.