Thanks!

By: Travis Downs (travis.downs.delete@this.gmail.com), January 24, 2020 12:44 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on January 23, 2020 4:33 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on January 23, 2020 12:51 am wrote:
> >
> > Yeah, it is, see this example.
>
> Ouch. Both compilers do some odd stupid things.
>
> clang seems to do much nicer register allocation , and avoids unnecessary move
> instructions. Plus gcc gets so confused about register allocation that it causes
> stack spills, so the end result is just ugly. clang just does much better.

I hadn't even look at it that carefully, but yeah gcc does badly here: it is really an indictment of that compiler's (vector?) register allocation.

Despite there being only 21 vector variables in the source, and no obvious need for any temporaries, gcc can't make do with the 32 registers available to it. There are lots of hard register allocation problems, but this doesn't seem to be one: even a very stupid allocator shouldn't run out of registers.

>
> Maybe the code gcc generates is equally fast (maybe it's not decode limited and the moves
> just turn to renames and the stack spills end up scheduling fine), but it just looks bad.

At least in the L1-hit scenario this code will be load throughput limited, so the extra stack loads gcc is going will certainly hurt it (at least until ICL and memory renaming). In case of misses you generally want as few redundant instructions as possible so you can fit as much useful work (especially: triggering other misses) in the OoO window.

So I am guessing this code will probably be double digit % slower in practice. Not that you'd write code like this: it was just my lazy way to force the compiler to use more than 16 registers.

>
> But then gcc handles that "nothing to sum" case so much better than clang,
> noticing that it's just zero, while the clang code there is just silly ("let's
> explicitly zero all these registers so that we can add them up").
>
> No idea whether the different init sequences (gcc: "zero one register, then move it to the others",
> clang: "zero all registers with xor") are better or worse, might depend on just uarch details.

It also caught my eye. I believe they will run at the same performance on modern Intel and AMD. In principle there can be some hardware limits to move elimination (e.g., how many in one cycle, how many total references a single register can have), but they don't seem to crop up often, and I don't think anyone has really measured it.

Clang wins on code size here as 8 of the vpxor are 4 bytes (VEX), and the rest are a mix of 5 (VEX, but xmm8 to 15) and 6 bytes (EVEX), while gcc is using 6 bytes for everything: but that's mostly just a consequence of gcc choosing vmovdqa64 (EVEX) for everything instead of vmovdqa (VEX), and also that gcc is using more high registers (as a consequence of its poor register allocation).

Apples to apples I think vpxor and vmovdqa are the same size. Maybe you can save a byte with the floating post ps versions.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
AVX-512 downclocking postTravis Downs2020/01/16 09:20 PM
  AVX-512 downclocking postanon³2020/01/17 01:25 AM
    AVX-512 downclocking postAndrei2020/01/17 02:47 AM
      AVX-512 downclocking postMontaray Jack2020/01/17 03:58 PM
        AVX-512 downclocking postAndrei2020/01/17 11:40 PM
          AVX-512 downclocking postMontaray Jack2020/01/19 02:10 AM
            AVX-512 downclocking postJan Olšan2020/01/19 01:01 PM
              AVX-512 downclocking postJan Olšan2020/01/19 01:11 PM
    AVX-512 downclocking postTravis Downs2020/01/17 02:59 PM
    AVX-512 downclocking postDavid Kanter2020/01/18 10:27 AM
      magnetic inductorsjokerman2020/01/18 08:03 PM
      AVX-512 downclocking postTravis Downs2020/01/24 11:36 AM
  AVX-512 downclocking postRay2020/01/17 02:22 AM
    AVX-512 downclocking postTravis Downs2020/01/17 01:10 PM
  AVX-512 downclocking postEtienne2020/01/17 03:16 AM
    Thanks, typos fixed and credited (NT)Travis Downs2020/01/17 01:15 PM
  Title suggestions welcome (NT)Travis Downs2020/01/17 08:54 AM
  AVX-512 downclocking postanonymou52020/01/17 10:53 AM
    AVX-512 downclocking postTravis Downs2020/01/17 11:14 AM
      AVX-512 downclocking postYoav2020/01/17 11:50 AM
        AVX-512 downclocking postTravis Downs2020/01/17 01:14 PM
      AVX-512 downclocking postanonymou52020/01/17 04:26 PM
        AVX-512 downclocking postTravis Downs2020/01/22 08:19 PM
          AVX-512 downclocking postanonymou52020/01/23 12:56 AM
            AVX-512 downclocking postFoyle2020/01/23 05:51 AM
              AVX-512 downclocking postanonymou52020/01/23 06:57 AM
                AVX-512 downclocking postTravis Downs2020/01/24 12:49 PM
            finer-grained licensesTravis Downs2020/01/24 01:03 PM
              finer-grained licensesanonymou52020/01/24 04:28 PM
                finer-grained licensesTravis Downs2020/01/25 09:46 AM
  post published (new line)Travis Downs2020/01/17 11:55 AM
    should say: (new LINK) (NT)Travis Downs2020/01/17 11:55 AM
      should say: (new LINK)Tim McCaffrey2020/01/17 01:44 PM
        Thanks, fixed and credited (NT)Travis Downs2020/01/17 02:54 PM
      should say: (new LINK)anon2020/01/17 09:12 PM
        should say: (new LINK)Travis Downs2020/01/22 03:28 PM
          Thanks!anon2020/01/22 08:06 PM
            Thanks!Travis Downs2020/01/22 08:16 PM
              Thanks!anon2020/01/22 10:20 PM
                Thanks!Travis Downs2020/01/23 01:51 AM
                  Thanks!Linus Torvalds2020/01/23 05:33 PM
                    Thanks!Travis Downs2020/01/24 12:44 PM
  Throttling dispatchGabriele Svelto2020/01/22 01:40 PM
    Itanium also used itDavid Kanter2020/01/22 02:04 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell tangerine? 🍊