By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), November 13, 2006 5:13 pm
Room: Moderated Discussions
Gabriele Svelto (gabriele.svelto@gmail.com) on 11/8/06 wrote:
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 11/6/06 wrote:
>---------------------------
>Yeah, the problem with codecs may be nasty as you are streaming data in however
>decoding is becoming less and less of a performance problem as of today. With DVDs
>accepting all kind of codecs with processors (and related offloading engines) which
>are fairly average on the performance curve.
Yes, hardware acceleration of codecs is great. But it takes quite a while before the latest (very complicated) codecs are stable (in terms of standardization) and available in hardware. Using pure software is simply more flexible. For less performance demanding codecs it is always better to use software. Even the slowest embedded CPUs are powerful enough to do JPEG in software.
>>I don't believe it is for free. You need quite a few extra instructions, and thus
>>more registers, fetch, decode and execution resources. It may have a minor performance
>>impact in your experience, but it is definitely not for free in terms of codesize,
>>power consumption and software complexity. Reading unaligned interleaved RGB structures
>>takes one instruction on ARM. I bet you need over 10 in Altivec...
>
>You're right for the code-size, for unaligned accesses you would still need an
>extra permute for each load instruction which accesses the unaligned data which
>might add up in the end. However the way AltiVec works if you are accessing a stream
>of data which is unaligned you're exactly in the kind of situation which is very
>easy for AltiVec. In the end you just need one load and one permute for every access
>(instead of two loads) by reusing the data of the previous load, eventually on the
>same register.
Yes, so it's 2N+2 instructions to do N unaligned reads and 2 extra registers (one data, one permute vector) for each stream of unaligned data.
>If you are dealing with interleaved data this is a plus as you can
>combine the shift-align permute vector with the de-interleave permute vector and
>have it done for free (since you would already need a permutation for alignment).
No. N-way (de)interleaving requires N log N permutes, and none of these can be merged into the initial de-alignment permute. So for unaligned 4-way de-interleaving you need 5 loads, 5 alignment permutes and 8 interleaving permutes, ie. 18 instructions plus 3 outside the loop to setup the permutes. I don't call that for free.
>BTW I didn't look into the new vector extensions for ARM, how do they look? Do
>they tollerate unaligned accesses?
ARM NEON is a hybrid 64/128 bit SIMD instruction set designed for easy autovectorization. For a quick overview see slide 7-9 of
http://www.iee-cambridge.org.uk/arc/seminar05/slides/RichardGrisenthwaite.pdf
or http://www.arm.com/pdfs/Tiger%20Whitepaper%20Final.pdf
Yes all vector load/stores handle unaligned accesses. Vectors are typically aligned to the natural alignment of the elements (ie. they are unaligned). Vector loads also deal with (de)interleaving, so the 4-way structure load that takes 21 instructions on Altivec takes 1 instruction in NEON. Both features were key requirements as they enable vectorization by compilers and greatly simplify low-level programming. It's a lot more powerful than anything I've seen.
>I heard that the they've been implemented with
>full exception handling in the front-end which deals with ordinary instructions
>so that the back-end was made very simple and low-power.
Yes, in Cortex-A8 (ARMs high-end CPU) the NEON unit sits at the back of the pipeline and can't generate any exceptions. This allows for a zero cycle load latency from L1 or L2 (128-bit wide). On MPEG4 it is expected to deliver performance comparable with Pentium 3/4 while using less than 1 Watt :-)
Wilco
---------------------------
>Wilco (Wilco.Dijkstra@ntlworld.com) on 11/6/06 wrote:
>---------------------------
>Yeah, the problem with codecs may be nasty as you are streaming data in however
>decoding is becoming less and less of a performance problem as of today. With DVDs
>accepting all kind of codecs with processors (and related offloading engines) which
>are fairly average on the performance curve.
Yes, hardware acceleration of codecs is great. But it takes quite a while before the latest (very complicated) codecs are stable (in terms of standardization) and available in hardware. Using pure software is simply more flexible. For less performance demanding codecs it is always better to use software. Even the slowest embedded CPUs are powerful enough to do JPEG in software.
>>I don't believe it is for free. You need quite a few extra instructions, and thus
>>more registers, fetch, decode and execution resources. It may have a minor performance
>>impact in your experience, but it is definitely not for free in terms of codesize,
>>power consumption and software complexity. Reading unaligned interleaved RGB structures
>>takes one instruction on ARM. I bet you need over 10 in Altivec...
>
>You're right for the code-size, for unaligned accesses you would still need an
>extra permute for each load instruction which accesses the unaligned data which
>might add up in the end. However the way AltiVec works if you are accessing a stream
>of data which is unaligned you're exactly in the kind of situation which is very
>easy for AltiVec. In the end you just need one load and one permute for every access
>(instead of two loads) by reusing the data of the previous load, eventually on the
>same register.
Yes, so it's 2N+2 instructions to do N unaligned reads and 2 extra registers (one data, one permute vector) for each stream of unaligned data.
>If you are dealing with interleaved data this is a plus as you can
>combine the shift-align permute vector with the de-interleave permute vector and
>have it done for free (since you would already need a permutation for alignment).
No. N-way (de)interleaving requires N log N permutes, and none of these can be merged into the initial de-alignment permute. So for unaligned 4-way de-interleaving you need 5 loads, 5 alignment permutes and 8 interleaving permutes, ie. 18 instructions plus 3 outside the loop to setup the permutes. I don't call that for free.
>BTW I didn't look into the new vector extensions for ARM, how do they look? Do
>they tollerate unaligned accesses?
ARM NEON is a hybrid 64/128 bit SIMD instruction set designed for easy autovectorization. For a quick overview see slide 7-9 of
http://www.iee-cambridge.org.uk/arc/seminar05/slides/RichardGrisenthwaite.pdf
or http://www.arm.com/pdfs/Tiger%20Whitepaper%20Final.pdf
Yes all vector load/stores handle unaligned accesses. Vectors are typically aligned to the natural alignment of the elements (ie. they are unaligned). Vector loads also deal with (de)interleaving, so the 4-way structure load that takes 21 instructions on Altivec takes 1 instruction in NEON. Both features were key requirements as they enable vectorization by compilers and greatly simplify low-level programming. It's a lot more powerful than anything I've seen.
>I heard that the they've been implemented with
>full exception handling in the front-end which deals with ordinary instructions
>so that the back-end was made very simple and low-power.
Yes, in Cortex-A8 (ARMs high-end CPU) the NEON unit sits at the back of the pipeline and can't generate any exceptions. This allows for a zero cycle load latency from L1 or L2 (128-bit wide). On MPEG4 it is expected to deliver performance comparable with Pentium 3/4 while using less than 1 Watt :-)
Wilco