By: Peter Lewis (peter.delete@this.notyahoo.com), June 5, 2022 9:22 pm
Room: Moderated Discussions
>> Intel’s GNA 2.0 in Tiger Lake uses 38mW. If this is scaled up to the H100 PCIe performance level,
>> it would use 3200W, which is almost 10x the power of the H100 PCIe. If people are ever going to
>> have something like Nvidia Maxine or Nvidia Riva running on local hardware, instead of the cloud,
>> they will need an enormous amount of power-efficient neural inference performance.
>>
>> youtube.com/watch?v=3GPNsPMqY8o
>
> Do you know what sort of performance is required for this sort of task
> (a) real-time translation
> (b) real-time translation from speech (as opposed to text)
> (c) real-time translation from speech (as opposed to text) with quality voice synthesis?
>
> Could this be done on an M1? Could it be done on an H100? Is it being done in the demo on a warehouse-sized machine?
> And is what is currently considered an inference NPU even the appropriate sort of HW, or does it require specializations
> like transformer hardware?
You have asked excellent questions and I don’t know the answer to any of them. The amount of compute required for speech recognition, language translation and speech synthesis depends on the quality of the results. It generally takes a lot more compute to get a little more quality. Dragon Naturally Speaking speech recognition (now owned by Microsoft) runs on a PC with no neural net acceleration and it is considered useable.
The Transformer Engine in the Nvidia H100 is for training transformers, not for inference, so I don’t think it will be required for using Maxine. You can imagine even more advanced systems that do some adaptation to the speaker as they are speaking, which could require training hardware, but I don’t know of anyone doing that yet.
I have no idea how much compute is required for the Maxine demo. I would guess the Maxine demo will not be economically feasible if it requires more than one H100 MIG (1/7th of an H100) per user. For privacy and security reasons, it should ideally run locally on a consumer GPU card or a local neural net accelerator. There is obviously a huge danger that this kind of technology could be used to make convincing looking videos of people saying things they never said. As is usual with new technology, something dreadful will have to happen before security is taken seriously. The first WiFi had no encryption whatsoever and it has taken 4 more tries to get to where we are today (WEP, WPA, WPA2, WPA3).
>> it would use 3200W, which is almost 10x the power of the H100 PCIe. If people are ever going to
>> have something like Nvidia Maxine or Nvidia Riva running on local hardware, instead of the cloud,
>> they will need an enormous amount of power-efficient neural inference performance.
>>
>> youtube.com/watch?v=3GPNsPMqY8o
>
> Do you know what sort of performance is required for this sort of task
> (a) real-time translation
> (b) real-time translation from speech (as opposed to text)
> (c) real-time translation from speech (as opposed to text) with quality voice synthesis?
>
> Could this be done on an M1? Could it be done on an H100? Is it being done in the demo on a warehouse-sized machine?
> And is what is currently considered an inference NPU even the appropriate sort of HW, or does it require specializations
> like transformer hardware?
You have asked excellent questions and I don’t know the answer to any of them. The amount of compute required for speech recognition, language translation and speech synthesis depends on the quality of the results. It generally takes a lot more compute to get a little more quality. Dragon Naturally Speaking speech recognition (now owned by Microsoft) runs on a PC with no neural net acceleration and it is considered useable.
The Transformer Engine in the Nvidia H100 is for training transformers, not for inference, so I don’t think it will be required for using Maxine. You can imagine even more advanced systems that do some adaptation to the speaker as they are speaking, which could require training hardware, but I don’t know of anyone doing that yet.
I have no idea how much compute is required for the Maxine demo. I would guess the Maxine demo will not be economically feasible if it requires more than one H100 MIG (1/7th of an H100) per user. For privacy and security reasons, it should ideally run locally on a consumer GPU card or a local neural net accelerator. There is obviously a huge danger that this kind of technology could be used to make convincing looking videos of people saying things they never said. As is usual with new technology, something dreadful will have to happen before security is taken seriously. The first WiFi had no encryption whatsoever and it has taken 4 more tries to get to where we are today (WEP, WPA, WPA2, WPA3).