By: David Kanter (, June 21, 2020 10:32 am
CACM just published a paper by Google's TPUv3 team which was a really interesting read:

There are some additional microarchitectural details, which I appreciated. It reiterates their emphasis on system-level integration (of the router) and gives some details on the interconnect.

The performance comparisons are also fascinating. Their section on energy efficiency is challenging because the energy/FLOP for BF16 is so much smaller than FP64. However, it's intriguing that a single TPU cluster on a real workload exceeds the FLOP/s of even the world's largest supercomputers on linpack...which is a rubbish workload.

While CNN's like AlphaZero are dense compute, they still have a variety of matrix shapes, which make them much more challenging than regular Linpack.

Anyway, it's a good read.

