No loop unrolling on Zen

By: Gian-Carlo Pascutto (, July 25, 2019 3:58 am
Room: Moderated Discussions

GCC won't unroll this at all, ICC will unroll by a factor 2.

Clang is, eh, more interesting. It defaults to 4. with -march=skylake, it'll go up to 8. But with -march=znver1, it won't unroll at all. Why?

We had a theory this was because Zen 1 has a special feature with loops that are 5 instructions or less (incl maco-ops fusion). But the loop is 6 instructions, and fiddling with the code to force there to be some extra doesn't change the behavior. So why wouldn't you want to unroll on Zen?

Some digging through LLVM finds:

Which is set in detail for Intel CPUs, but seems not to be for AMD (despite what the comment says). I'm not sure this is the cause, but it could explain why clang is willing to unroll more on later Intel models.

I don't find any reference to loop caches in Agners' manual, only this:
The processor has an extra cache for decoded instructions. The size is indicated as 2048 μops, with a line size of 8 μops. This is big enough for holding most critical loops.

Anyone has any other ideas about clang's behavior here?

 Next Post in Thread >
TopicPosted ByDate
No loop unrolling on ZenGian-Carlo Pascutto2019/07/25 03:58 AM
  No loop unrolling on ZenMontaray Jack2019/07/30 05:45 AM
Reply to this Topic
Body: No Text
How do you spell purple?