Apple GPU reverse engineering

By: Jeff S. (, February 16, 2021 11:19 am
Room: Moderated Discussions
Chester ( on February 14, 2021 2:26 pm wrote:
> David Kanter ( on February 14, 2021 10:50 am wrote:
> > Someone has started doing some good RE work on Apple's GPUs, I thought I'd share it here:
> >
> > A few comments on things I find interesting:
> > 1. Unusual control flow mechanism compared to other GPUs
> How's it unusual? AMD GCN/RDNA also use execution masks to handle conditional execution
> and divergence. Looks like Apple requires an explicit jmp_exec_none instruction to skip
> a zero exec mask while GCN/RDNA do that in hardware, but the idea's pretty similar.

The oddest thing I caught when I skimmed this was that the exec mask was claimed to be stored in a vector register, not scalar/uniform. This would definitely be an odd wrinkle if correct. I think it's important to remember that exec masks being architectural or microarchitectural doesn't matter much from a practical sense, although it had allowed NVidia to maintain claims about their architectures "not being SIMD". It wasn't until Volta that inter-lane deadlock was actually addressed via further microarchitetural state tracking/handling, but there's still just masked SIMD exec units under the cover for the most part.

> > 2. Register cache is interesting, I wonder how much is SW hinted or controlled?
> Sounds like Nvidia's register reuse cache. From your link, Apple's is SW controlled with a "cache" hint to
> tell HW to save a register value for reuse, and a "discard" hint to invalidate a cache register value.
> Nvidia also uses software hints in control flow information on Maxwell and newer GPUs to
> indicate when to cache a register value for reuse. They don't have a discard hint though.

The register caching definitely deserves further scrutiny, especially if the source caching is less "positional" than NV's. The traditional GPU source operand flow is:
independent RF read ports -> crossbar -> per-operand collector -> execution -> forwarding bypass back to crossbar/RF writeback

The point of the crossbar and source operand collectors in general is to reduce RF read port conflict stalls, rather than say trying to make the RF a triple-read-ported SRAM. Programmatically controlled operand collectors can further help reduce read pressure if re-reads of registers might need to happen arbitrarily many instructions apart, rather than being LRU evicted or strict ring queues.

Operand reuse cache (at least for NV) is just keeping an extra slot or two in each source operand collector, which means:
  • A cached value has to be re-read by the same operand position in a subsequent instruction to be useful. I.e., fma sum, fact1, fact2, addend can't cache fact1 for use by subsequent fma sum2, fact3, fact1, addend2
  • since it flows through distinct datapaths to the execution unit.
  • switching wavefronts clears the cache since there are physically only ~2 slots, not many tens, per operand position
The existence of a reuse flag on the destination register operand in the Apple architecture might suggest a more generalized operand collector system, possibly with a crossbar between a collector and exec units. Position-independent source operand register caching would circumstantially support that hypothesis.

All this said, note that reg reuse caches are just one of many ways of fighting against RF read starvation. Scalar register datapaths to operand collectors, dual 2-read-ported instead of quad 1p RFs are probably more broadly applicable mechanisms than controlled reuse.

Also, I'm skeptical that the cache discard instruction can actually result in data loss/UB as by destroying a bypass forwarded operand and letting some stale value in somehow, etc.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Apple GPU reverse engineeringDavid Kanter2021/02/14 10:50 AM
  Apple GPU reverse engineeringChester2021/02/14 02:26 PM
    Apple GPU reverse engineeringJeff S.2021/02/16 11:19 AM
      Apple GPU reverse engineeringK.K.2021/02/16 11:48 PM
        Apple GPU reverse engineeringPocak2021/02/17 06:54 AM
          Apple GPU reverse engineeringK.K.2021/02/18 12:40 AM
            Apple GPU reverse engineeringAnon2021/02/18 03:23 AM
Reply to this Topic
Body: No Text
How do you spell avocado?