Fermi and Gen9 on GB4 HistEq

By: Chester (lamchester.delete@this.gmail.com), January 1, 2020 7:01 pm
Room: Moderated Discussions
The other thread about Gen graphics got me curious. I'm not going to directly compare because they're different architectures with different programming models (cuda vs opencl), but I still find it interesting to look at how these microarchitectures perform. So let's talk about GPU architectures :)

I'm using the GF108, with SMs similar to the ones described in David Kanter's article, except fatter. Each of the SM's two schedulers can dual issue to reach a max of 4 instructions issued per clock, instead of just two. Also, there are 48 (3x16) FP32 ALUs in each SM, instead of just 32.

When running the CUDA version of GB4's HistEq, those SMs get:

  • 1.08 IPC issued

  • 1.06 IPC executed (slightly lower because instructions will be replayed on cache miss or bank conflicts and other stuff)

  • 40.5% L1 hitrate, 30.7% L2 hitrate

  • 47.1% achieved occupancy (active warps / max warps)

  • 58.2% warp execution efficiency (active threads per warp / max threads per warp)

  • "Mid (4)" to "Low (2)" ALU utilization for the hottest kernels

NVProf also exposes counters for stalls and instruction issue:
Fermi not dual issuing a lot, stalling on instruction fetch and execution dependencies

I wonder what's behind the instruction fetch stalls. The CPU version of GB4's HistEq benchmark gets 99% icache hitrates on every CPU I've tried it on, but GPUs are different of course. Also, only ~6% of instructions were dual issued. Maybe that's where execution dependency stalls come in.

Now, the Gen9 iGPU in Coffee Lake. The OpenCL code compiled to SIMD32 kernels.
Intel Gen9 on GB4 histeq - 82.8% occupancy, 53% eu array active, 46% stall, 0.5% idle, both FPUs active 22%

VTune says 1.45 IPC but I think it's only counting non-stalled cycles. The 22% both FPUs active figure counts cycles averaged across all EUs when both FPUs were active.

Memory hierarchy (thanks vtune):
vtune diagram of gen9 mem hierarchy

VTune doesn't break down stall reasons like nvprof. From the high sampler utilization, the EUs might be waiting on memory.

At least in this example, it seems like Intel's able to make better use of execution units. Gen9 is able to dual dispatch more often. Even when it can't, it's losing less throughput (just one SIMD4 port, not leaving a stack of 16 ALUs inactive).

Finally, I'm not too familiar with either architecture to feel free to correct me if I got things wrong.
 Next Post in Thread >
TopicPosted ByDate
Fermi and Gen9 on GB4 HistEqChester2020/01/01 07:01 PM
Reply to this Topic
Body: No Text
How do you spell purple?