By: --- (---.delete@this.redheron.com), June 5, 2022 6:40 pm
Room: Moderated Discussions
Peter Lewis (peter.delete@this.notyahoo.com) on June 5, 2022 4:20 pm wrote:
> > in general, any accelerator which requires an API is going to be substantially
> > less useful than one which can be accessed through the ISA
>
> A huge advantage of an API for something like neural inference is that when more hardware is added
> in future generations, code written for a previous generation will automatically make use of it.
>
> It will be interesting to see the measured power consumption difference for a dedicated
> hardware block like Apple’s Neural Engine compared to neural inference instructions
> added to the CPU core like Intel’s Advanced Matrix Extensions (AMX). A dedicated hardware
> block has to be more power efficient but I don’t know by how much.
>
> The Gaussian and Neural Accelerator 2.0 (GNA 2.0) that you mentioned on Tiger Lake
> is intended for low-power rather than high performance. It only provides 38 Gop/s.
> For comparison, the Neural Engine in Apple’s M1 Ultra provides 22 Top/s and Nvidia’s
> H100 PCIe provides 3200 Top/s when 50% of weights are zero (2:1 sparsity).
>
> Intel’s GNA 2.0 in Tiger Lake uses 38mW. If this is scaled up to the H100 PCIe performance level,
> it would use 3200W, which is almost 10x the power of the H100 PCIe. If people are ever going to
> have something like Nvidia Maxine or Nvidia Riva running on local hardware, instead of the cloud,
> they will need an enormous amount of power-efficient neural inference performance.
>
> youtube.com/watch?v=3GPNsPMqY8o
Do you know what sort of performance is required for this sort of task
(a) real-time translation
(b) real-time translation from speech (as opposed to text)
(c) real-time translation from speech (as opposed to text) with quality voice synthesis?
Could this be done on an M1? Could it be done on an H100? Is it being done in the demo on a warehouse-sized machine?
And is what is currently considered an inference NPU even the appropriate sort of HW, or does it require specializations like transformer hardware?
> > in general, any accelerator which requires an API is going to be substantially
> > less useful than one which can be accessed through the ISA
>
> A huge advantage of an API for something like neural inference is that when more hardware is added
> in future generations, code written for a previous generation will automatically make use of it.
>
> It will be interesting to see the measured power consumption difference for a dedicated
> hardware block like Apple’s Neural Engine compared to neural inference instructions
> added to the CPU core like Intel’s Advanced Matrix Extensions (AMX). A dedicated hardware
> block has to be more power efficient but I don’t know by how much.
>
> The Gaussian and Neural Accelerator 2.0 (GNA 2.0) that you mentioned on Tiger Lake
> is intended for low-power rather than high performance. It only provides 38 Gop/s.
> For comparison, the Neural Engine in Apple’s M1 Ultra provides 22 Top/s and Nvidia’s
> H100 PCIe provides 3200 Top/s when 50% of weights are zero (2:1 sparsity).
>
> Intel’s GNA 2.0 in Tiger Lake uses 38mW. If this is scaled up to the H100 PCIe performance level,
> it would use 3200W, which is almost 10x the power of the H100 PCIe. If people are ever going to
> have something like Nvidia Maxine or Nvidia Riva running on local hardware, instead of the cloud,
> they will need an enormous amount of power-efficient neural inference performance.
>
> youtube.com/watch?v=3GPNsPMqY8o
Do you know what sort of performance is required for this sort of task
(a) real-time translation
(b) real-time translation from speech (as opposed to text)
(c) real-time translation from speech (as opposed to text) with quality voice synthesis?
Could this be done on an M1? Could it be done on an H100? Is it being done in the demo on a warehouse-sized machine?
And is what is currently considered an inference NPU even the appropriate sort of HW, or does it require specializations like transformer hardware?