By: Dummond D. Slow (mental.delete@this.protozoa.us), May 18, 2021 1:29 pm
Room: Moderated Discussions
Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 3:47 pm wrote:
> different anon (different.delete@this.anon.com) on November 18, 2020 1:27 pm wrote:
> > Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 11:06 am wrote:
> > > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 18, 2020 9:42 am wrote:
> > > > Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 9:21 am wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on November 18, 2020 9:13 am wrote:
> > > > > >
> > > > > > x264 in SPEC is not there to help you decide which PC to buy for ripping DVD content!
> > > > > > It is there as an exemplar of certain styles of code: various generic compression techniques
> > > > > > (so lots of bit by bit manipulation) and various image analysis techniques (so searches
> > > > > > over images and image comparisons at various frequency granularities).
> > > > > >
> > > > >
> > > > > You didn't read it? If x264 is example of a kind of code, it is an example of code
> > > > > heavily optimised with multimedia (integer) SIMD. It's a greeat example or maybe
> > > > > too great, other codebases like ffmpeg or x265 will be a bit less optimized.
> > > > >
> > > > > If you want to explore such code, run it with assembly. It has assembly for ARM too, and not that
> > > > > little of it. Without SIMD, it is the opposite of example of multimedia compression code.
> > > >
> > > > You do realise that many of the key loops are autovectorized right? Yes it's probably not quite
> > > > as fast as handwritten libraries, but it runs much faster with vectorization enabled.
> > > >
> > > > Wilco
> > >
> > > I already said that elsewhere: autovectorization mostly fails on the kind of integer SIMD routines the
> > > encoders use. It is generally considered not remotely usable for encoders. One reason is that to get
> > > the performance, you usually can't just SIMDify naively, it needs some restructuring and transformation
> > > of the computation to get the kinds of speedups assembly does. It's probably not the only factor.
> > >
> > > Note that this is not from my head, this is based on information
> > > I'm getting from the people who do this kind
> > > of code in FFmpeg, Rav1e, Eve, Dav1d, x264, x265. Both
> > > x86 and ARM code. So pretty please, if you think this
> > > is not true merely based on your assumptions or knowledge
> > > from unrelated fields or general outlook of compilers
> > > and vectorization, stop right now and go ask people who do this code before you continue disputing this.
> >
> > Compiler writers probably look more at SPEC for autovectorization work than basically any other code in the
> > entire world, including the SPEC integer "hard to autovectorize" code that we're talking about here.
> >
>
> And did it change anything? AFAIK no, you still have to do the hand-SIMD. One example illustrating
> this: Do you know why Dav1d AV1 decoding is fast for 10bit profile on ARM but very slow on x86?
>
> Surprise reason: Acording to the devs of it, ARM has hand-written assembly (sponsored by Netflix
> IIRC). x86 has virtually none for 10bit, because nobody sponsored it. And if you bench it, you will
> see how good autovectorization of multimedia code has become (= it has not, it's bloody slow).
>
> See it for yourself here.
> Check the Graviton2's score relative to x86 for 1) Video Input: Summer Nature 4K (8bit where
> both CPU platforms have assembly optimization) and 2) Chimera 1080p 10-bit, where only ARM
> has assembly optimization. You can see autovectorization proving itself, folks.
>
>
And now we can see the proof of how essential hand-written SIMD assembly is in multimedia, in practice.
https://www.phoronix.com/scan.php?page=news_item&px=AVX2-dav1d-0.9-Benchmarks
See last graph. The speedup realised by this new AVX2 code is in other words the degree to which compiler autovectorization fails to extract the performance that is possible with SIMD.
(Note: ARM side has had the assembly for this 10bit decoding code already from before.)
> Again, I'm just trying to convey what people working in the very field will
> tell you. This is not opinions or gut feelings. Can we already move on?
>
> > >
> > > Point at relevant state of the art video encoder or decoder codebase and I'm fairly
> > > sure it will rely on hand-written assembly. At worst, there will be compiler intrinsics,
> > > but those are already frowned upon and give inferior results.
> >
> >
>
>
> different anon (different.delete@this.anon.com) on November 18, 2020 1:27 pm wrote:
> > Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 11:06 am wrote:
> > > Wilco (wilco.dijkstra.delete@this.ntlworld.com) on November 18, 2020 9:42 am wrote:
> > > > Dummond D. Slow (mental.delete@this.protozoa.us) on November 18, 2020 9:21 am wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on November 18, 2020 9:13 am wrote:
> > > > > >
> > > > > > x264 in SPEC is not there to help you decide which PC to buy for ripping DVD content!
> > > > > > It is there as an exemplar of certain styles of code: various generic compression techniques
> > > > > > (so lots of bit by bit manipulation) and various image analysis techniques (so searches
> > > > > > over images and image comparisons at various frequency granularities).
> > > > > >
> > > > >
> > > > > You didn't read it? If x264 is example of a kind of code, it is an example of code
> > > > > heavily optimised with multimedia (integer) SIMD. It's a greeat example or maybe
> > > > > too great, other codebases like ffmpeg or x265 will be a bit less optimized.
> > > > >
> > > > > If you want to explore such code, run it with assembly. It has assembly for ARM too, and not that
> > > > > little of it. Without SIMD, it is the opposite of example of multimedia compression code.
> > > >
> > > > You do realise that many of the key loops are autovectorized right? Yes it's probably not quite
> > > > as fast as handwritten libraries, but it runs much faster with vectorization enabled.
> > > >
> > > > Wilco
> > >
> > > I already said that elsewhere: autovectorization mostly fails on the kind of integer SIMD routines the
> > > encoders use. It is generally considered not remotely usable for encoders. One reason is that to get
> > > the performance, you usually can't just SIMDify naively, it needs some restructuring and transformation
> > > of the computation to get the kinds of speedups assembly does. It's probably not the only factor.
> > >
> > > Note that this is not from my head, this is based on information
> > > I'm getting from the people who do this kind
> > > of code in FFmpeg, Rav1e, Eve, Dav1d, x264, x265. Both
> > > x86 and ARM code. So pretty please, if you think this
> > > is not true merely based on your assumptions or knowledge
> > > from unrelated fields or general outlook of compilers
> > > and vectorization, stop right now and go ask people who do this code before you continue disputing this.
> >
> > Compiler writers probably look more at SPEC for autovectorization work than basically any other code in the
> > entire world, including the SPEC integer "hard to autovectorize" code that we're talking about here.
> >
>
> And did it change anything? AFAIK no, you still have to do the hand-SIMD. One example illustrating
> this: Do you know why Dav1d AV1 decoding is fast for 10bit profile on ARM but very slow on x86?
>
> Surprise reason: Acording to the devs of it, ARM has hand-written assembly (sponsored by Netflix
> IIRC). x86 has virtually none for 10bit, because nobody sponsored it. And if you bench it, you will
> see how good autovectorization of multimedia code has become (= it has not, it's bloody slow).
>
> See it for yourself here.
> Check the Graviton2's score relative to x86 for 1) Video Input: Summer Nature 4K (8bit where
> both CPU platforms have assembly optimization) and 2) Chimera 1080p 10-bit, where only ARM
> has assembly optimization. You can see autovectorization proving itself, folks.
>
>
And now we can see the proof of how essential hand-written SIMD assembly is in multimedia, in practice.
https://www.phoronix.com/scan.php?page=news_item&px=AVX2-dav1d-0.9-Benchmarks
See last graph. The speedup realised by this new AVX2 code is in other words the degree to which compiler autovectorization fails to extract the performance that is possible with SIMD.
(Note: ARM side has had the assembly for this 10bit decoding code already from before.)
> Again, I'm just trying to convey what people working in the very field will
> tell you. This is not opinions or gut feelings. Can we already move on?
>
> > >
> > > Point at relevant state of the art video encoder or decoder codebase and I'm fairly
> > > sure it will rely on hand-written assembly. At worst, there will be compiler intrinsics,
> > > but those are already frowned upon and give inferior results.
> >
> >
>
>