By: David Kanter (dkanter.delete@this.realworldtech.com), August 16, 2014 4:09 pm
Room: Moderated Discussions
> However (other people may have different opinions) the direction this has taken is basically meaningless.
> No-one is excited or impressed by the fact that adding an AES instruction to your CPU makes it run AES code
> faster; likewise the addition of wider vectors is not especially interesting. So what IS interesting?
>
> IMHO:
> - on the instruction side, adding vectors for the FIRST time is interesting, because that does
> require not completely trivial changes to the whole core. Depending on exactly how the core
> operates, the changes may be mostly obvious, but they can be more substantial --- for example
> a core that used to carry register VALUES in the instruction queue and ROB may be forced to
> switch to carry register identifiers, with knock-on effects to the rest of the core.
Actually, Intel didn't change anything about internal design to accomodate SSE. It wasn't until the AVX registers that they bothered to change from a data-ful ROB to a data-less ROB (the latter was used by the P4).
I'd also point out that AVX 3-op and FMA are much more distruptive to the core uarch than vectors. That definitely forces a change in uop format.
> - instructions that substantially affect the load/store/memory/synchronization architecture are interesting.
> The double-loads of ARM don't really qualify for this (they're cute,but don't change how you'd design
> the whole core). Prefetching instructions could, maybe, qualify, if ANYONE had a version of them that
> didn't completely suck. The pattern here on both POWER and x86 has been that initial versions sucked,
> incompatible new versions were added which also sucked, and eventually everyone involved just gave
> up and went to invisible HW prefetching (which is probably the right way to do it). ARM, as far as
> I know, bypassed the "let's so it crappily in SW" stage and went straight to HW.
Yeah SW prefetch is really tough to get correct. That being said, it actually works for Java and other dynamic languages where the run time has a good deal of visibility into the programming.
> The handling of non-aligned loads would fall into this category except that, as far as I know, there are
> no great divergences here. POWER is probably too constrained with the vector loads (unless they have loosened
> up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with
> not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers
> (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load
> crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
I prefer Intel's approach to unaligned. Sure dealing with page crossing sucks, but it is even more problematic to make paging visible to programmers.
> Likewise for synchronization primitives. The consensus as I read the literature is
> that load locked/store conditional is substantially easier to implement and get right
> than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> it's then also easier to build HW TM on top of load locked/store conditional.
> Beyond this, I'm guessing it's substantially harder to design for and verify
> the Intel memory model than the looser POWER and ARM memory models.
Transactional memory is harder to implement on relaxed ordering models (e.g., ARM, PPC), since it's technically stronger than processor ordering (i.e., x86).
> - All in all, however, there's too much obsession in this argument about exact instructions. When I
> talk about it being hard to design an x86 CPU, I'm not saying this because it's hard to add a Decimal
> Adjust ASCII instruction, or even a REP MOV instruction. There's vastly more to x86 than just the instruction
> set. It's the fact that you have to support 8086 mode, and 286 mode, and virtual 8086 mode, and SMM
> and all the rest of it. It's ALL this stuff that I see as baggage. And yes, once you've done it once,
> at about the PPro level, you have a basis to work off for the future, but it's always imposing a tax.
> You can't, for example, COMPLETELY streamline your load/store path, even for x64, because you have
> to support FS and GS registers, so you need weird side branches to handle that.
Generally I'd agree with you. The problem is that quantifying those effects is very difficult. I've talked with Andy Glew, Bob Colwell, Mike Haertel and probably a dozen other architects. They all say that x86 imposes overhead, but nobody thought it was every >10% and many suspected 5% was a good guess.
> I stated in a previous post that David and I maybe ultimately agreed that the x86 tax was worth
> about two development years, and that continues to me to seem like a good way to view it.
I'd say it's worth about 5-10% performance on the CPU core - all things being equal. OFC, things are rarely equal in the real world.
> Does this mean two extra years to create an equivalent performing CPU, or that, *with the same sized team
> and same process*, an ARM or POWER device would lead an x86 device by about two years? I'd say, from the
> very limited evidence we have, both interpretations are reasonable. IF, for example (we'll see soon enough)
> an Apple A8 performs generally at the level of a Broadwell-Y (a flexible metric --- there's absolute single-threaded
> performance, multi-threaded performance, AVX-assisted FLOPS, performance/watt, GPU performance, dynamic
> range of performance etc --- but let's ignore the details for now) one could reasonably argue that the
> x86 complexity tax is the equivalent of about two years in process improvement.
I don't think comparing BDW or HSW, which target from tablets to servers, to Apple's phone/tablet SoC will necessarily yield a lot of information.
David
> No-one is excited or impressed by the fact that adding an AES instruction to your CPU makes it run AES code
> faster; likewise the addition of wider vectors is not especially interesting. So what IS interesting?
>
> IMHO:
> - on the instruction side, adding vectors for the FIRST time is interesting, because that does
> require not completely trivial changes to the whole core. Depending on exactly how the core
> operates, the changes may be mostly obvious, but they can be more substantial --- for example
> a core that used to carry register VALUES in the instruction queue and ROB may be forced to
> switch to carry register identifiers, with knock-on effects to the rest of the core.
Actually, Intel didn't change anything about internal design to accomodate SSE. It wasn't until the AVX registers that they bothered to change from a data-ful ROB to a data-less ROB (the latter was used by the P4).
I'd also point out that AVX 3-op and FMA are much more distruptive to the core uarch than vectors. That definitely forces a change in uop format.
> - instructions that substantially affect the load/store/memory/synchronization architecture are interesting.
> The double-loads of ARM don't really qualify for this (they're cute,but don't change how you'd design
> the whole core). Prefetching instructions could, maybe, qualify, if ANYONE had a version of them that
> didn't completely suck. The pattern here on both POWER and x86 has been that initial versions sucked,
> incompatible new versions were added which also sucked, and eventually everyone involved just gave
> up and went to invisible HW prefetching (which is probably the right way to do it). ARM, as far as
> I know, bypassed the "let's so it crappily in SW" stage and went straight to HW.
Yeah SW prefetch is really tough to get correct. That being said, it actually works for Java and other dynamic languages where the run time has a good deal of visibility into the programming.
> The handling of non-aligned loads would fall into this category except that, as far as I know, there are
> no great divergences here. POWER is probably too constrained with the vector loads (unless they have loosened
> up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with
> not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers
> (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load
> crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
I prefer Intel's approach to unaligned. Sure dealing with page crossing sucks, but it is even more problematic to make paging visible to programmers.
> Likewise for synchronization primitives. The consensus as I read the literature is
> that load locked/store conditional is substantially easier to implement and get right
> than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> it's then also easier to build HW TM on top of load locked/store conditional.
> Beyond this, I'm guessing it's substantially harder to design for and verify
> the Intel memory model than the looser POWER and ARM memory models.
Transactional memory is harder to implement on relaxed ordering models (e.g., ARM, PPC), since it's technically stronger than processor ordering (i.e., x86).
> - All in all, however, there's too much obsession in this argument about exact instructions. When I
> talk about it being hard to design an x86 CPU, I'm not saying this because it's hard to add a Decimal
> Adjust ASCII instruction, or even a REP MOV instruction. There's vastly more to x86 than just the instruction
> set. It's the fact that you have to support 8086 mode, and 286 mode, and virtual 8086 mode, and SMM
> and all the rest of it. It's ALL this stuff that I see as baggage. And yes, once you've done it once,
> at about the PPro level, you have a basis to work off for the future, but it's always imposing a tax.
> You can't, for example, COMPLETELY streamline your load/store path, even for x64, because you have
> to support FS and GS registers, so you need weird side branches to handle that.
Generally I'd agree with you. The problem is that quantifying those effects is very difficult. I've talked with Andy Glew, Bob Colwell, Mike Haertel and probably a dozen other architects. They all say that x86 imposes overhead, but nobody thought it was every >10% and many suspected 5% was a good guess.
> I stated in a previous post that David and I maybe ultimately agreed that the x86 tax was worth
> about two development years, and that continues to me to seem like a good way to view it.
I'd say it's worth about 5-10% performance on the CPU core - all things being equal. OFC, things are rarely equal in the real world.
> Does this mean two extra years to create an equivalent performing CPU, or that, *with the same sized team
> and same process*, an ARM or POWER device would lead an x86 device by about two years? I'd say, from the
> very limited evidence we have, both interpretations are reasonable. IF, for example (we'll see soon enough)
> an Apple A8 performs generally at the level of a Broadwell-Y (a flexible metric --- there's absolute single-threaded
> performance, multi-threaded performance, AVX-assisted FLOPS, performance/watt, GPU performance, dynamic
> range of performance etc --- but let's ignore the details for now) one could reasonably argue that the
> x86 complexity tax is the equivalent of about two years in process improvement.
I don't think comparing BDW or HSW, which target from tablets to servers, to Apple's phone/tablet SoC will necessarily yield a lot of information.
David