By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), November 17, 2006 2:51 pm
Room: Moderated Discussions
Gabriele Svelto (gabriele.svelto@gmail.com) on 11/14/06 wrote:
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 11/13/06 wrote:
>---------------------------
>>Yes all vector load/stores handle unaligned accesses. Vectors are typically aligned
>>to the natural alignment of the elements (ie. they are unaligned). Vector loads
>>also deal with (de)interleaving, so the 4-way structure load that takes 21 instructions
>>on Altivec takes 1 instruction in NEON. Both features were key requirements as they
>>enable vectorization by compilers and greatly simplify low-level programming. It's
>>a lot more powerful than anything I've seen.
>
>I see, does that mean that Neon instructions can write to more than one register?
>I don't see how would you deinterleave the stream without accessing more than one register.
Correct, the load/store instructions can write/read up to 4 64-bit registers. Implementations typically transfer 2 registers per cycle. These instructions enable very high bandwidth without needing more than 1 memory access per cycle. In some ways it is like register windows in software.
Wilco
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 11/13/06 wrote:
>---------------------------
>>Yes all vector load/stores handle unaligned accesses. Vectors are typically aligned
>>to the natural alignment of the elements (ie. they are unaligned). Vector loads
>>also deal with (de)interleaving, so the 4-way structure load that takes 21 instructions
>>on Altivec takes 1 instruction in NEON. Both features were key requirements as they
>>enable vectorization by compilers and greatly simplify low-level programming. It's
>>a lot more powerful than anything I've seen.
>
>I see, does that mean that Neon instructions can write to more than one register?
>I don't see how would you deinterleave the stream without accessing more than one register.
Correct, the load/store instructions can write/read up to 4 64-bit registers. Implementations typically transfer 2 registers per cycle. These instructions enable very high bandwidth without needing more than 1 memory access per cycle. In some ways it is like register windows in software.
Wilco