By: rwessel (robertwessel.delete@this.yahoo.com), August 16, 2014 9:43 pm
Room: Moderated Discussions
Ricardo B (ricardo.b.delete@this.xxxxx.xx) on August 16, 2014 5:43 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 2:24 pm wrote:
> > The handling of non-aligned loads would fall into this category except that, as far as I know, there are
> > no great divergences here. POWER is probably too constrained
> > with the vector loads (unless they have loosened
> > up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with
> > not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers
> > (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load
> > crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
> >
> > Likewise for synchronization primitives. The consensus as I read the literature is
> > that load locked/store conditional is substantially easier to implement and get right
> > than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> > it's then also easier to build HW TM on top of load locked/store conditional.
> > Beyond this, I'm guessing it's substantially harder to design for and verify
> > the Intel memory model than the looser POWER and ARM memory models.
>
> Wow, wow, easy there.
>
> You can't chalk x86's stricter memory model, it's support for
> unaligned loads or the atomics as just "x86 complexity"
> Sure, it's easier to implement only aligned loads and LL/SC in hardware.
>
> But it's also easier to implement a non-pipelined in order CPU than a super-scalar pipelined
> out of order CPU; it's easier to implement a CPU without SIMD than one with SIMD.
>
> Yes, they've been there for ever and you can't remove them.
> But that doesn't mean they're bad for performance.
>
> For example, Intel/AMD don't just support unaligned words. They
> make them work *well*. And they do it ever increasingly.
> Originally, SSE didn't support unaligned load-ops and unaligned load/stores had a performance penalty.
> But with time, they added support for aligned load-ops and made the performance penalty go away.
>
> Why? Because in the end of the day, they conclude the performance
> improvement is worth the hardware complexity.
I think many of the software guys (myself included, and if you like, I'm sure Linus will be happy to give you an earful on the subject) will argue that good support for unaligned accesses in a major performance benefit on real systems. Even at a non-trivial cost in some "processor level" or microbenchmark performance measure. Even IPF and POWER support unaligned accesses to some extent.
> Maynard Handley (name99.delete@this.name99.org) on August 16, 2014 2:24 pm wrote:
> > The handling of non-aligned loads would fall into this category except that, as far as I know, there are
> > no great divergences here. POWER is probably too constrained
> > with the vector loads (unless they have loosened
> > up relative to the PPC days, which they may have). Intel is probably too free --- loads of hassle with
> > not much benefit. ARMv8 seems to have the right balance of getting the cases that matter for programmers
> > (who aren't sociopaths) efficient, while allowing weird edge cases (the kinds of things where one load
> > crosses two pages) to be slow or even fail (eg the restrictions on the double-register loads).
> >
> > Likewise for synchronization primitives. The consensus as I read the literature is
> > that load locked/store conditional is substantially easier to implement and get right
> > than LOCK prefixes and the random mix of other things that Intel has. I'm guessing
> > it's then also easier to build HW TM on top of load locked/store conditional.
> > Beyond this, I'm guessing it's substantially harder to design for and verify
> > the Intel memory model than the looser POWER and ARM memory models.
>
> Wow, wow, easy there.
>
> You can't chalk x86's stricter memory model, it's support for
> unaligned loads or the atomics as just "x86 complexity"
> Sure, it's easier to implement only aligned loads and LL/SC in hardware.
>
> But it's also easier to implement a non-pipelined in order CPU than a super-scalar pipelined
> out of order CPU; it's easier to implement a CPU without SIMD than one with SIMD.
>
> Yes, they've been there for ever and you can't remove them.
> But that doesn't mean they're bad for performance.
>
> For example, Intel/AMD don't just support unaligned words. They
> make them work *well*. And they do it ever increasingly.
> Originally, SSE didn't support unaligned load-ops and unaligned load/stores had a performance penalty.
> But with time, they added support for aligned load-ops and made the performance penalty go away.
>
> Why? Because in the end of the day, they conclude the performance
> improvement is worth the hardware complexity.
I think many of the software guys (myself included, and if you like, I'm sure Linus will be happy to give you an earful on the subject) will argue that good support for unaligned accesses in a major performance benefit on real systems. Even at a non-trivial cost in some "processor level" or microbenchmark performance measure. Even IPF and POWER support unaligned accesses to some extent.