By: Ungo (a.delete@this.b.c.d.e), May 14, 2013 3:50 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 14, 2013 5:45 am wrote:
> Except for the final entropy encoding stage (Huffman, arithmetic or similar) which part
> of video encoders is bot SMT friendly and poorly suitable for implementing in [GP]GPU?
>
> And entropy encoding itself is so tiny part of the total encode job that it hardly justifies
> major HW feature. At best, it can justify addition of couple of instructions.
IMO you couldn't be more wrong about there being no justification.
H.264 CABAC actually isn't very SMT friendly. It's an arithmetic code, ill suited to multiple threads. In order to thread it, you need to live with a lower compression ratio (work divided into many small independent chunks) or high latency (larger chunks).
Lots of important video encoding applications need both high compression and minimal latency at the same time. Think videoconferencing, remote display (WiDi or Apple's proprietary protocol), and so on.
Power efficiency is also very important in mobile devices, for values of "mobile" including x86 notebooks. It might be possible to design some extra instructions to accelerate arithmetic encoding, but if you're going to have devoted HW for other parts of the encoding pipeline (and you'll want to) it's almost certain to be easier, more power efficient, and even more area efficient to make it a dedicated encoder stage instead.
> Except for the final entropy encoding stage (Huffman, arithmetic or similar) which part
> of video encoders is bot SMT friendly and poorly suitable for implementing in [GP]GPU?
>
> And entropy encoding itself is so tiny part of the total encode job that it hardly justifies
> major HW feature. At best, it can justify addition of couple of instructions.
IMO you couldn't be more wrong about there being no justification.
H.264 CABAC actually isn't very SMT friendly. It's an arithmetic code, ill suited to multiple threads. In order to thread it, you need to live with a lower compression ratio (work divided into many small independent chunks) or high latency (larger chunks).
Lots of important video encoding applications need both high compression and minimal latency at the same time. Think videoconferencing, remote display (WiDi or Apple's proprietary protocol), and so on.
Power efficiency is also very important in mobile devices, for values of "mobile" including x86 notebooks. It might be possible to design some extra instructions to accelerate arithmetic encoding, but if you're going to have devoted HW for other parts of the encoding pipeline (and you'll want to) it's almost certain to be easier, more power efficient, and even more area efficient to make it a dedicated encoder stage instead.