By: none (none.delete@this.none.com), October 3, 2015 3:11 am
Room: Moderated Discussions
Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 3, 2015 4:02 am wrote:
> none (none.delete@this.none.com) on October 3, 2015 2:04 am wrote:
> > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 5:06 pm wrote:
> > [...]
> > > GCC does do a lot of function calls. Not sure whether there are performance counters that can count
> > > load vs LDP, but a static count should give a reasonable idea anyway given GCC is not loop heavy.
> >
> > You don't need performance counters on real hardware for this kind of measures, you can
> > use a fast simulator.
> >
> > On 403.gcc compiled with way:
> > gcc-linaro-4.9-2015.02-3-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
> > -DSPEC_CPU_LP64 -DSPEC_CPU -Ofast -mcpu=cortex-a57 -static
> >
> > The 9 inputs total ~947B instructions. Among them 201B are loads and 74B are stores.
> > Among these ld/st, ~36B are LDP and ~38B are STP. Most of them are memset/memcpy and
> > function prologues/epilogues.
>
> Note GCC 4.9 doesn't have general LDP/STP enabled, so GCC 5
> or latest trunk will show even more LDP/STP instructions.
Do you mean FSF trunk? I might give it a try then, though if a precompiled one exists
somewhere, that'd help.
> > > For clearing there is a special clear instruction - current cores clear 64-128
> > > bytes per instruction as fast as L1 cache can write back into L2.
> >
> > There are ~22B dc zva in 403.gcc (assuming the instruction clears 64 bytes at a time).
> > gcc loves clearing memory :-)
>
> IIRC the default for fast simulator is 16 bytes. Yes, GCC spends a lot of its time in memset...
I guess you are talking of ARM FVP. That's not what I'm using, I'm running a faster
simulator, QEMU.
> none (none.delete@this.none.com) on October 3, 2015 2:04 am wrote:
> > Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on October 2, 2015 5:06 pm wrote:
> > [...]
> > > GCC does do a lot of function calls. Not sure whether there are performance counters that can count
> > > load vs LDP, but a static count should give a reasonable idea anyway given GCC is not loop heavy.
> >
> > You don't need performance counters on real hardware for this kind of measures, you can
> > use a fast simulator.
> >
> > On 403.gcc compiled with way:
> > gcc-linaro-4.9-2015.02-3-x86_64_aarch64-linux-gnu/bin/aarch64-linux-gnu-gcc
> > -DSPEC_CPU_LP64 -DSPEC_CPU -Ofast -mcpu=cortex-a57 -static
> >
> > The 9 inputs total ~947B instructions. Among them 201B are loads and 74B are stores.
> > Among these ld/st, ~36B are LDP and ~38B are STP. Most of them are memset/memcpy and
> > function prologues/epilogues.
>
> Note GCC 4.9 doesn't have general LDP/STP enabled, so GCC 5
> or latest trunk will show even more LDP/STP instructions.
Do you mean FSF trunk? I might give it a try then, though if a precompiled one exists
somewhere, that'd help.
> > > For clearing there is a special clear instruction - current cores clear 64-128
> > > bytes per instruction as fast as L1 cache can write back into L2.
> >
> > There are ~22B dc zva in 403.gcc (assuming the instruction clears 64 bytes at a time).
> > gcc loves clearing memory :-)
>
> IIRC the default for fast simulator is 16 bytes. Yes, GCC spends a lot of its time in memset...
I guess you are talking of ARM FVP. That's not what I'm using, I'm running a faster
simulator, QEMU.