By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), July 11, 2013 6:26 am
Room: Moderated Discussions
none (none.delete@this.none.com) on July 11, 2013 5:12 am wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 11, 2013 5:03 am wrote:
> [...]
> > Which benchmark is affected by denormals? I thought pretty much any modern
> > CPU nowadays deals with denormals in hardware with minimal penalty...
>
> That's an Intel claim, so I can't say. I have no reason not to believe it.
Alright. I would personally want to see some hard evidence, such as which benchmarks are affected and by how much. The 2 FP benchmarks where Atom does really bad are blur and sharpen image, but it's hard to see how you could accidentally make a simple filter use denormals.
> [...]
> > Yes, GCC can still generate some inefficient code at times, especially the array accesses look
> > bad... The Intel version is vectorized, which means the ARM version will be about twice as
> > fast again when built with Neon. So yes, setting compiler options etc right matters...
>
> The x86 loop I showed is scalar. There's a vectored one in Geekbench, for x86 only, but I'm not
> sure it's ever run (and if it is, then Saltwell stinks even more than what Geekbench shows).
OK. I forgot to mention - although the code looks very inefficient, the extra instructions don't matter that much on a wide OoO core. There are 9 instructions in the ARM inner loop, so they take 3 cycles to decode, issue and execute on A15. What determines the actual cycles per iteration is the latency of vmla. With unrolling (also not enabled - was this compiled -Os?!?) and vectorization it could probably run 8 times faster - it's why these loops are typically hand optimized...
Wilco
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on July 11, 2013 5:03 am wrote:
> [...]
> > Which benchmark is affected by denormals? I thought pretty much any modern
> > CPU nowadays deals with denormals in hardware with minimal penalty...
>
> That's an Intel claim, so I can't say. I have no reason not to believe it.
Alright. I would personally want to see some hard evidence, such as which benchmarks are affected and by how much. The 2 FP benchmarks where Atom does really bad are blur and sharpen image, but it's hard to see how you could accidentally make a simple filter use denormals.
> [...]
> > Yes, GCC can still generate some inefficient code at times, especially the array accesses look
> > bad... The Intel version is vectorized, which means the ARM version will be about twice as
> > fast again when built with Neon. So yes, setting compiler options etc right matters...
>
> The x86 loop I showed is scalar. There's a vectored one in Geekbench, for x86 only, but I'm not
> sure it's ever run (and if it is, then Saltwell stinks even more than what Geekbench shows).
OK. I forgot to mention - although the code looks very inefficient, the extra instructions don't matter that much on a wide OoO core. There are 9 instructions in the ARM inner loop, so they take 3 cycles to decode, issue and execute on A15. What determines the actual cycles per iteration is the latency of vmla. With unrolling (also not enabled - was this compiled -Os?!?) and vectorization it could probably run 8 times faster - it's why these loops are typically hand optimized...
Wilco