By: Ricardo B (ricardo.b.delete@this.xxxxx.xx), August 16, 2014 5:43 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 2:24 pm wrote:
> The handling of non-aligned loads would fall into this category except that, as far as I know, there are
> no great divergences here. POWER is probably too constrained with the vector loads (unless they have loosened
> up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with
> not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers
> (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load
> crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
>
> Likewise for synchronization primitives. The consensus as I read the literature is
> that load locked/store conditional is substantially easier to implement and get right
> than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> it's then also easier to build HW TM on top of load locked/store conditional.
> Beyond this, I'm guessing it's substantially harder to design for and verify
> the Intel memory model than the looser POWER and ARM memory models.
Wow, wow, easy there.
You can't chalk x86's stricter memory model, it's support for unaligned loads or the atomics as just "x86 complexity"
Sure, it's easier to implement only aligned loads and LL/SC in hardware.
But it's also easier to implement a non-pipelined in order CPU than a super-scalar pipelined out of order CPU; it's easier to implement a CPU without SIMD than one with SIMD.
Yes, they've been there for ever and you can't remove them. But that doesn't mean they're bad for performance.
For example, Intel/AMD don't just support unaligned words. They make them work *well*. And they do it ever increasingly.
Originally, SSE didn't support unaligned load-ops and unaligned load/stores had a performance penalty. But with time, they added support for aligned load-ops and made the performance penalty go away.
Why? Because in the end of the day, they conclude the performance improvement is worth the hardware complexity.
> - All in all, however, there's too much obsession in this argument about exact instructions. When I
> talk about it being hard to design an x86 CPU, I'm not saying this because it's hard to add a Decimal
> Adjust ASCII instruction, or even a REP MOV instruction. There's vastly more to x86 than just the instruction
> set. It's the fact that you have to support 8086 mode, and 286 mode, and virtual 8086 mode, and SMM
> and all the rest of it. It's ALL this stuff that I see as baggage. And yes, once you've done it once,
> at about the PPro level, you have a basis to work off for the future, but it's always imposing a tax.
> You can't, for example, COMPLETELY streamline your load/store path, even for x64, because you have
> to support FS and GS registers, so you need weird side branches to handle that.
People keep referring to the x86 tax in terms of transistors, as if those transistors could directly be translated to a performance improvement.
That was true maybe a decade ago.
But now, with designs being limited by power, local hot spots and wire delay, it's no longer so easy to throw in transistors and improve core performance.
From the point of view of power and performance, the complexity of the ISA is not so relevant as long as the baggage doesn't sit in critical paths.
And for high performance Intel cores, it mostly doesn't.
The 8086 mode doesn't add too much complexity.
In hardware, it's mostly the 80386 protected mode with pre-set values loaded in the register descriptors and the MMU disabled.
There's no support for 286 mode. 286 protected mode, fortunately, did not carry on to the 80386.
AAA is micro-coded.
REP is microcoded. It's mostly useless but REP MOVS*/STOS* are actually GoodThings™ to have. It tends to perform better and more consistently than software memcpy/memmov/memset.
The segment registers are annoying, but there's little evidence the extra adder and checks pose a penalty in the virtual address generation path.
If that's what worries you as x86 tax, you're looking at the wrong place.
One of the major x86 problems are in it's complex encoding, which clearly imposes area, power and performance (decoding throughput restrictions, pipeline length) penalties.
But that penalty is greatly reduced by using a µop cache (Sandy Bridge onwards)
In it's use of destructive operations, requiring extra mov reg,reg operations (some times, lots of them), which take energy and execution resources on every x86 CPU.
Again, that penalty can be greatly reduced by eliminating them at rename (Ivy Bridge onwards, next AMD high end cores too IIRC).
It's smaller number of GPRs, with forces more use of caches, can also be an issue.
> The handling of non-aligned loads would fall into this category except that, as far as I know, there are
> no great divergences here. POWER is probably too constrained with the vector loads (unless they have loosened
> up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with
> not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers
> (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load
> crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
>
> Likewise for synchronization primitives. The consensus as I read the literature is
> that load locked/store conditional is substantially easier to implement and get right
> than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> it's then also easier to build HW TM on top of load locked/store conditional.
> Beyond this, I'm guessing it's substantially harder to design for and verify
> the Intel memory model than the looser POWER and ARM memory models.
Wow, wow, easy there.
You can't chalk x86's stricter memory model, it's support for unaligned loads or the atomics as just "x86 complexity"
Sure, it's easier to implement only aligned loads and LL/SC in hardware.
But it's also easier to implement a non-pipelined in order CPU than a super-scalar pipelined out of order CPU; it's easier to implement a CPU without SIMD than one with SIMD.
Yes, they've been there for ever and you can't remove them. But that doesn't mean they're bad for performance.
For example, Intel/AMD don't just support unaligned words. They make them work *well*. And they do it ever increasingly.
Originally, SSE didn't support unaligned load-ops and unaligned load/stores had a performance penalty. But with time, they added support for aligned load-ops and made the performance penalty go away.
Why? Because in the end of the day, they conclude the performance improvement is worth the hardware complexity.
> - All in all, however, there's too much obsession in this argument about exact instructions. When I
> talk about it being hard to design an x86 CPU, I'm not saying this because it's hard to add a Decimal
> Adjust ASCII instruction, or even a REP MOV instruction. There's vastly more to x86 than just the instruction
> set. It's the fact that you have to support 8086 mode, and 286 mode, and virtual 8086 mode, and SMM
> and all the rest of it. It's ALL this stuff that I see as baggage. And yes, once you've done it once,
> at about the PPro level, you have a basis to work off for the future, but it's always imposing a tax.
> You can't, for example, COMPLETELY streamline your load/store path, even for x64, because you have
> to support FS and GS registers, so you need weird side branches to handle that.
People keep referring to the x86 tax in terms of transistors, as if those transistors could directly be translated to a performance improvement.
That was true maybe a decade ago.
But now, with designs being limited by power, local hot spots and wire delay, it's no longer so easy to throw in transistors and improve core performance.
From the point of view of power and performance, the complexity of the ISA is not so relevant as long as the baggage doesn't sit in critical paths.
And for high performance Intel cores, it mostly doesn't.
The 8086 mode doesn't add too much complexity.
In hardware, it's mostly the 80386 protected mode with pre-set values loaded in the register descriptors and the MMU disabled.
There's no support for 286 mode. 286 protected mode, fortunately, did not carry on to the 80386.
AAA is micro-coded.
REP is microcoded. It's mostly useless but REP MOVS*/STOS* are actually GoodThings™ to have. It tends to perform better and more consistently than software memcpy/memmov/memset.
The segment registers are annoying, but there's little evidence the extra adder and checks pose a penalty in the virtual address generation path.
If that's what worries you as x86 tax, you're looking at the wrong place.
One of the major x86 problems are in it's complex encoding, which clearly imposes area, power and performance (decoding throughput restrictions, pipeline length) penalties.
But that penalty is greatly reduced by using a µop cache (Sandy Bridge onwards)
In it's use of destructive operations, requiring extra mov reg,reg operations (some times, lots of them), which take energy and execution resources on every x86 CPU.
Again, that penalty can be greatly reduced by eliminating them at rename (Ivy Bridge onwards, next AMD high end cores too IIRC).
It's smaller number of GPRs, with forces more use of caches, can also be an issue.