By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), November 5, 2006 3:14 pm
Room: Moderated Discussions
Linus Torvalds (torvalds@osdl.org) on 11/5/06 wrote:
---------------------------
>Michael S (already5chosen@yahoo.com) on 11/5/06 wrote:
>>
>>Wishful thinking rather than good news. W.r.t. alignment
>>SSE is no better than your average RISC.
>
>Oh, but it is, and it's getting better.
>
>Yes, there are special instructions to handle the unaligned
>case, and yes, they used to suck horribly. Intel even added
>a special "load-only" unaligned instruction to make it suck
>less, but if you look at the Core 2 details, you'll notice
>that that instruction actually is now documented to be the
>same as the generic unaligned move ("movdqu") instruction.
I bet this is changing due to autovectorization.
>So yes, doing a full 16-byte byte shifter was considered
>too expensive, and I do agree that there is a big difference
>between a "word load" and a block load like MMX, so it does
>end up being a special case.
With SIMD you pretty much need a big permute block to change arrays-of-structs into structs-of-arrays. That block can be reused to deal with the unaligned case too. Since you typically need to do the permutes after loading and before storing vectors, it makes unaligned support quite cheap and natural in a SIMD pipeline.
>But going by history, it will
>follow the same path that the integer side did: unaligned
>loads will be slower, but not all that much slower, and yes,
>hardware does handle them.
>
>Already, in Core 2, I think you can do a "movdqu" every
>two cycles. And trust me, Intel does it that well because
>it actually does matter.
SIMD needs hardware unaligned support: most vectors are unaligned (they are typically aligned to their base type). The overhead of doing alignment in software is impractical in most cases (high overhead, bad codesize). For good auto vectorization you want to be able to take a random pointer and just load N elements without worrying about alignment.
>So even for block loads and stores, x86 does those 16-byte
>entities better unaligned than the RISC people did normal
>unaligned words. Two cycles, one single instruction.
Two cycles for an unaligned load is typical, even on RISC. In fact the latest ARM (Cortex-A8) can do it in one cycle as long as the access is within a cacheline - just like x86... There is a cost when you straddle a line boundary of course.
The MIPS approach of using 2 instructions for an unaligned access was flawed. It would have been better to have separate instructions for aligned and unaligned cases and let the micro architects decide how fast they want to make either.
>In fact, we've already seen x86 widen the SSE datapath
>and ALUs from 64 bits (two cycles for all operations,
>even the simple ones) to 128 bits (giving us single-cycle
>ops for most of it). It took a few generations.
>
>And I'd not be at all surprised if in another one or two
>generations you'll find that they widened the load path,
>and 16-byte unaligned loads end up doing the same thing
>that they currently do for the smaller integer loads.
I would be surprised if they became faster than 2 cycles. The simple reason is that very wide accesses have a high chance of straddling a cache line, and that could mean an expensive replay. So it's slower unless you know that a very high percentage falls within a cacheline.
Another possibility is to use load multiple instructions. This reduces the overhead of unaligned accesses as for N accesses you only need N+1 cycles rather than 2N. For example, Cortex-A8 can read up to 64 bytes in just 5 cycles from an unaligned address. Core 2's movdqu needs 8.
Wilco
---------------------------
>Michael S (already5chosen@yahoo.com) on 11/5/06 wrote:
>>
>>Wishful thinking rather than good news. W.r.t. alignment
>>SSE is no better than your average RISC.
>
>Oh, but it is, and it's getting better.
>
>Yes, there are special instructions to handle the unaligned
>case, and yes, they used to suck horribly. Intel even added
>a special "load-only" unaligned instruction to make it suck
>less, but if you look at the Core 2 details, you'll notice
>that that instruction actually is now documented to be the
>same as the generic unaligned move ("movdqu") instruction.
I bet this is changing due to autovectorization.
>So yes, doing a full 16-byte byte shifter was considered
>too expensive, and I do agree that there is a big difference
>between a "word load" and a block load like MMX, so it does
>end up being a special case.
With SIMD you pretty much need a big permute block to change arrays-of-structs into structs-of-arrays. That block can be reused to deal with the unaligned case too. Since you typically need to do the permutes after loading and before storing vectors, it makes unaligned support quite cheap and natural in a SIMD pipeline.
>But going by history, it will
>follow the same path that the integer side did: unaligned
>loads will be slower, but not all that much slower, and yes,
>hardware does handle them.
>
>Already, in Core 2, I think you can do a "movdqu" every
>two cycles. And trust me, Intel does it that well because
>it actually does matter.
SIMD needs hardware unaligned support: most vectors are unaligned (they are typically aligned to their base type). The overhead of doing alignment in software is impractical in most cases (high overhead, bad codesize). For good auto vectorization you want to be able to take a random pointer and just load N elements without worrying about alignment.
>So even for block loads and stores, x86 does those 16-byte
>entities better unaligned than the RISC people did normal
>unaligned words. Two cycles, one single instruction.
Two cycles for an unaligned load is typical, even on RISC. In fact the latest ARM (Cortex-A8) can do it in one cycle as long as the access is within a cacheline - just like x86... There is a cost when you straddle a line boundary of course.
The MIPS approach of using 2 instructions for an unaligned access was flawed. It would have been better to have separate instructions for aligned and unaligned cases and let the micro architects decide how fast they want to make either.
>In fact, we've already seen x86 widen the SSE datapath
>and ALUs from 64 bits (two cycles for all operations,
>even the simple ones) to 128 bits (giving us single-cycle
>ops for most of it). It took a few generations.
>
>And I'd not be at all surprised if in another one or two
>generations you'll find that they widened the load path,
>and 16-byte unaligned loads end up doing the same thing
>that they currently do for the smaller integer loads.
I would be surprised if they became faster than 2 cycles. The simple reason is that very wide accesses have a high chance of straddling a cache line, and that could mean an expensive replay. So it's slower unless you know that a very high percentage falls within a cacheline.
Another possibility is to use load multiple instructions. This reduces the overhead of unaligned accesses as for N accesses you only need N+1 cycles rather than 2N. For example, Cortex-A8 can read up to 64 bytes in just 5 cycles from an unaligned address. Core 2's movdqu needs 8.
Wilco