Article: AMD's Mobile Strategy
By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), December 22, 2011 5:44 am
Room: Moderated Discussions
gallier2 (gallier2@gmx.de) on 12/22/11 wrote:
---------------------------
>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>---------------------------
>>gallier2 (gallier2@gmx.de) on 12/21/11 wrote:
>>---------------------------
>>>You should, it has nothing to do with hand generated assembler. Every string litteral
>>>in a C program is a global variable and normal C programs are littered with strings.
>>>That was for instance what killed me on SPARC when going from 32 to 64 bits binaries.
>>>Every address load went from 2 to 5 instructions and it used 3 instead of 2 registers
>>>which had downwind pessimization effects on register allocation.
>>
>>But with archs with PC-relative addressing like ARM strings are just kept near
>>the functions where they're used and pointers to them are usually generated in one instruction.
>>
>>Taking five instructions to generate them is weird, did they really have to be
>>stored in a section so far from 0? But I guess SPARC should have added PC-relative addressing.
>
>Here a 2 lines of C
>
>char *p = getenv("TIMER_ENABLED");
>timer_enabled = p ? (*p == '2' ? 2 : 1) : 0;
>
>timer_enabled is a global variable of type bool (8 bit wide with C99 semantic).
>
>That's what gcc 3.4.6 for UltraSPARC III with -O3 generates
>
>.section ".rodata"
>.align 8
>.LLC29:
>.asciz "TIMER_ENABLED"
>
>.section ".text"
>.align 4
>.align 32
>.global main
>.type main, #function
>.proc 04
>main:
>!#PROLOGUE# 0
>save %sp, -416, %sp
>!#PROLOGUE# 1
>
>sethi %hh(.LLC29), %o5 !, tmp119
>sethi %lm(.LLC29), %o3 !, tmp120
>or %o5, %hm(.LLC29), %o4 ! tmp119,, tmp121
>sethi %hh(timer_enabled), %l0 !, tmp1373
>sllx %o4, 32, %o2 ! tmp121,, tmp122
>sethi %lm(timer_enabled), %l1 !, tmp1374
>add %o2, %o3, %o0 ! tmp122, tmp120, tmp123
>call getenv, 0 !,
>or %o0, %lo(.LLC29), %o0 ! tmp123,,
>or %l0, %hm(timer_enabled), %o1 ! tmp1373,, tmp127
>mov 0, %g4 !, tmp131
>movrne %o0, 1, %g4 !,,, tmp131
>sllx %o1, 32, %g5 ! tmp127,, tmp128
>add %g5, %l1, %g1 ! tmp128, tmp1374, tmp129
>cmp %g4, 0 ! tmp131,
>bne,pn %icc, .LL244 !
>stb %g4, [%g1+%lo(timer_enabled)] ! tmp131, timer_enabled
>
>
>It's a bit difficult to read because the address generation of both addresses are mixed due to scheduling.
>
>Here what That's what gcc 4.3.3 -O3 generates using the SUNstudio backend for sparc64vii
>
>.L900000678:
>call .+8
>or %g0,%o7,%o7
>sethi %pc22(_GLOBAL_OFFSET_TABLE_-(.L900000678-.)),%l5
>sethi %gdop_hix22(.LLC30),%i5
>add %l5,%pc10(_GLOBAL_OFFSET_TABLE_-(.L900000678-.)),%l5
>xor %i5,%gdop_lox10(.LLC30),%l2
>add %l5,%o7,%i4
>ldx [%i4+%l2],%o7,%gdop(.LLC30)
>add %o7,-112,%o0
>add %o7,6,%l2
>call getenv ! params = %o0 ! Result = %o0
>nop
>sethi %gdop_hix22(timer_enabled),%o5
>xor %o5,%gdop_lox10(timer_enabled),%o4
>or %g0,1,%o2
>ldx [%i4+%o4],%o3,%gdop(timer_enabled)
>brz,a,pt %o0,.L900000677
>or %g0,0,%o2
>
>.L900000677:
>andcc %o2,255,%g0
>be,pn %icc,.L77000734
>stb %o2,[%o3]
>
>Which, I have to admit, I don't really understand.
>
>On the runtime side and that's where the kicker is, the old gcc 3 code is a bit
>faster in general for real applications, especially when calling shared libraries
>a lot. Every µ-benchmark I tried yet, where I isolate a function and time it the
>best I can sees the gcc 4 code win hands down. But when using real code that does
>the real work (translation memory retrieval and file conversion) the gcc 3 code
>is faster. That's the reason why we haven't switched yet. I use gcc 4 with its better
>frontend (and to a lesser degree SUNstudio) to get better diagnostics of the source,
>but the production remains in gcc 3.
It's not easy to read indeed but it is obvious both compilers generate ridiculously inefficient code. GCC generates full 64-bit addresses for the 2 accesses, taking a whopping 6 instructions do access a global or take its address!!! The Sunstudio compiler uses a call to fake PC-relative addressing and uses 5 instructions to create a 32-bit PC-relative base pointer. But then it uses another full 32-bit offset from this base (why???) to load 64-bit address constants to get the global address, needing 3 instructions to generate the final address or 4 to actually access a global. Could you possibly do it in a more inefficient way?
This is the kind of stuff that gives RISC a bad name. And now it is no surprise to me why Sparc is so slow - everything else is likely as inefficient. You do not need more than 4GB of global variables, so you need at most a 32-bit immediate (PC relative or absolute) to access any global. Or even better, reserve one of the 32 registers as a global pointer so up to 8KB of globals can be accessed directly using a single load/store. For DLL/shared libs you need to do something more elaborate of course but that's fine.
Wilco
---------------------------
>Exophase (exophase@gmail.com) on 12/21/11 wrote:
>---------------------------
>>gallier2 (gallier2@gmx.de) on 12/21/11 wrote:
>>---------------------------
>>>You should, it has nothing to do with hand generated assembler. Every string litteral
>>>in a C program is a global variable and normal C programs are littered with strings.
>>>That was for instance what killed me on SPARC when going from 32 to 64 bits binaries.
>>>Every address load went from 2 to 5 instructions and it used 3 instead of 2 registers
>>>which had downwind pessimization effects on register allocation.
>>
>>But with archs with PC-relative addressing like ARM strings are just kept near
>>the functions where they're used and pointers to them are usually generated in one instruction.
>>
>>Taking five instructions to generate them is weird, did they really have to be
>>stored in a section so far from 0? But I guess SPARC should have added PC-relative addressing.
>
>Here a 2 lines of C
>
>char *p = getenv("TIMER_ENABLED");
>timer_enabled = p ? (*p == '2' ? 2 : 1) : 0;
>
>timer_enabled is a global variable of type bool (8 bit wide with C99 semantic).
>
>That's what gcc 3.4.6 for UltraSPARC III with -O3 generates
>
>.section ".rodata"
>.align 8
>.LLC29:
>.asciz "TIMER_ENABLED"
>
>.section ".text"
>.align 4
>.align 32
>.global main
>.type main, #function
>.proc 04
>main:
>!#PROLOGUE# 0
>save %sp, -416, %sp
>!#PROLOGUE# 1
>
>sethi %hh(.LLC29), %o5 !, tmp119
>sethi %lm(.LLC29), %o3 !, tmp120
>or %o5, %hm(.LLC29), %o4 ! tmp119,, tmp121
>sethi %hh(timer_enabled), %l0 !, tmp1373
>sllx %o4, 32, %o2 ! tmp121,, tmp122
>sethi %lm(timer_enabled), %l1 !, tmp1374
>add %o2, %o3, %o0 ! tmp122, tmp120, tmp123
>call getenv, 0 !,
>or %o0, %lo(.LLC29), %o0 ! tmp123,,
>or %l0, %hm(timer_enabled), %o1 ! tmp1373,, tmp127
>mov 0, %g4 !, tmp131
>movrne %o0, 1, %g4 !,,, tmp131
>sllx %o1, 32, %g5 ! tmp127,, tmp128
>add %g5, %l1, %g1 ! tmp128, tmp1374, tmp129
>cmp %g4, 0 ! tmp131,
>bne,pn %icc, .LL244 !
>stb %g4, [%g1+%lo(timer_enabled)] ! tmp131, timer_enabled
>
>
>It's a bit difficult to read because the address generation of both addresses are mixed due to scheduling.
>
>Here what That's what gcc 4.3.3 -O3 generates using the SUNstudio backend for sparc64vii
>
>.L900000678:
>call .+8
>or %g0,%o7,%o7
>sethi %pc22(_GLOBAL_OFFSET_TABLE_-(.L900000678-.)),%l5
>sethi %gdop_hix22(.LLC30),%i5
>add %l5,%pc10(_GLOBAL_OFFSET_TABLE_-(.L900000678-.)),%l5
>xor %i5,%gdop_lox10(.LLC30),%l2
>add %l5,%o7,%i4
>ldx [%i4+%l2],%o7,%gdop(.LLC30)
>add %o7,-112,%o0
>add %o7,6,%l2
>call getenv ! params = %o0 ! Result = %o0
>nop
>sethi %gdop_hix22(timer_enabled),%o5
>xor %o5,%gdop_lox10(timer_enabled),%o4
>or %g0,1,%o2
>ldx [%i4+%o4],%o3,%gdop(timer_enabled)
>brz,a,pt %o0,.L900000677
>or %g0,0,%o2
>
>.L900000677:
>andcc %o2,255,%g0
>be,pn %icc,.L77000734
>stb %o2,[%o3]
>
>Which, I have to admit, I don't really understand.
>
>On the runtime side and that's where the kicker is, the old gcc 3 code is a bit
>faster in general for real applications, especially when calling shared libraries
>a lot. Every µ-benchmark I tried yet, where I isolate a function and time it the
>best I can sees the gcc 4 code win hands down. But when using real code that does
>the real work (translation memory retrieval and file conversion) the gcc 3 code
>is faster. That's the reason why we haven't switched yet. I use gcc 4 with its better
>frontend (and to a lesser degree SUNstudio) to get better diagnostics of the source,
>but the production remains in gcc 3.
It's not easy to read indeed but it is obvious both compilers generate ridiculously inefficient code. GCC generates full 64-bit addresses for the 2 accesses, taking a whopping 6 instructions do access a global or take its address!!! The Sunstudio compiler uses a call to fake PC-relative addressing and uses 5 instructions to create a 32-bit PC-relative base pointer. But then it uses another full 32-bit offset from this base (why???) to load 64-bit address constants to get the global address, needing 3 instructions to generate the final address or 4 to actually access a global. Could you possibly do it in a more inefficient way?
This is the kind of stuff that gives RISC a bad name. And now it is no surprise to me why Sparc is so slow - everything else is likely as inefficient. You do not need more than 4GB of global variables, so you need at most a 32-bit immediate (PC relative or absolute) to access any global. Or even better, reserve one of the 32 registers as a global pointer so up to 8KB of globals can be accessed directly using a single load/store. For DLL/shared libs you need to do something more elaborate of course but that's fine.
Wilco