By: David Kanter (dkanter.delete@this.realworldtech.com), October 8, 2022 10:16 pm
Room: Moderated Discussions
--- (---.delete@this.redheron.com) on October 7, 2022 2:40 pm wrote:
> Is there yet any sort of movement for GP-NPU (analogous to GP-GPU, general purpose compute on an NPU)?
> I can't speak for other designs (and there seem to be no good overviews or even consensus yet as to what
> these things should look like) but the Apple one appears to be essentially/primarily a convolution engine.
> Define an (essentially fixed) set of weights, run a stream of data against those weights, and accumulate
> sums. What we get in HW is a whole lot (low-precision) MACs linked to some (wider precision) accumulators,
> some specialized storage (weights plus input stream buffer) and some HW-assisted addressing.
>
> Point is, you get much less than even GPUs when GP-GPU began.
>
> So, can we do anything interesting with this that's not actually NPU related?
> The most obvious possibly that struck me was random number related stuff; you can hook up these thing
> to act as LFSRs and (perhaps) generate lots of (few bit, adequate quality?) random numbers per cycle,
> then either concatenate them to generate streams of multi-bit integers uniformly distributed's or
> add 6 or 12 or so of them to generate (adequate?) gaussian values. I don't care about adversarial
> security stuff, I'm more interested in "good-enough" for various types of physics work.
>
> Presumably (as was done in the early days of BrookGPU) you would have to fake this by creating
> a neural net in TensorFlow or equivalent that used the convolution options available to perform
> an LFSR (or more appropriate RNG) on each element of an array of input data, pooled them together
> (if summing for Gaussian) and dumped out a similar array of "random"s.
> This may not seem like much, but if you could have it running in parallel with say a large Monte Carlo integration
> as the very 1st stage of generating a constant stream of uniform or gaussian randoms, before we condition
> them to fit a process, maybe there is some value there, the ability to double the speed or more?
>
>
> Anyway, point is, has anyone heard any sort of mutterings along these lines? Or is everyone, even academics,
> still so excited by what new things can be done on GPUs that no-one yet even started thinking about NPUs?
There are few things that are *new* that you can do on a GPU (except some of the new instructions for smith-waterman and some of the shared memory and barrier stuff). You can just do things at higher throughput than what is typically feasible on a CPU. The whole GP idea is about taking the 'core' functions of a GPU and making them more easily accessible and extending the functions that are supported to include more CPU-like attributes.
The only thing I could potentially see really applying is enabling lower overhead interactions between an NPU and a CPU. E.g., instead of having to write out the memory via DMA, having very low latency and tight integration. Having the NPU operate on the address space with paging (sounds expensive to have paging in an NPU) or having an NPU operate on some low latency scratchpad memory close to the core (sounds like SPR's TMUL stuff).
If you could move data from a CPU program to the NPU in ~10 cycles that would potentially open some interesting stuff.
David
> Is there yet any sort of movement for GP-NPU (analogous to GP-GPU, general purpose compute on an NPU)?
> I can't speak for other designs (and there seem to be no good overviews or even consensus yet as to what
> these things should look like) but the Apple one appears to be essentially/primarily a convolution engine.
> Define an (essentially fixed) set of weights, run a stream of data against those weights, and accumulate
> sums. What we get in HW is a whole lot (low-precision) MACs linked to some (wider precision) accumulators,
> some specialized storage (weights plus input stream buffer) and some HW-assisted addressing.
>
> Point is, you get much less than even GPUs when GP-GPU began.
>
> So, can we do anything interesting with this that's not actually NPU related?
> The most obvious possibly that struck me was random number related stuff; you can hook up these thing
> to act as LFSRs and (perhaps) generate lots of (few bit, adequate quality?) random numbers per cycle,
> then either concatenate them to generate streams of multi-bit integers uniformly distributed's or
> add 6 or 12 or so of them to generate (adequate?) gaussian values. I don't care about adversarial
> security stuff, I'm more interested in "good-enough" for various types of physics work.
>
> Presumably (as was done in the early days of BrookGPU) you would have to fake this by creating
> a neural net in TensorFlow or equivalent that used the convolution options available to perform
> an LFSR (or more appropriate RNG) on each element of an array of input data, pooled them together
> (if summing for Gaussian) and dumped out a similar array of "random"s.
> This may not seem like much, but if you could have it running in parallel with say a large Monte Carlo integration
> as the very 1st stage of generating a constant stream of uniform or gaussian randoms, before we condition
> them to fit a process, maybe there is some value there, the ability to double the speed or more?
>
>
> Anyway, point is, has anyone heard any sort of mutterings along these lines? Or is everyone, even academics,
> still so excited by what new things can be done on GPUs that no-one yet even started thinking about NPUs?
There are few things that are *new* that you can do on a GPU (except some of the new instructions for smith-waterman and some of the shared memory and barrier stuff). You can just do things at higher throughput than what is typically feasible on a CPU. The whole GP idea is about taking the 'core' functions of a GPU and making them more easily accessible and extending the functions that are supported to include more CPU-like attributes.
The only thing I could potentially see really applying is enabling lower overhead interactions between an NPU and a CPU. E.g., instead of having to write out the memory via DMA, having very low latency and tight integration. Having the NPU operate on the address space with paging (sounds expensive to have paging in an NPU) or having an NPU operate on some low latency scratchpad memory close to the core (sounds like SPR's TMUL stuff).
If you could move data from a CPU program to the NPU in ~10 cycles that would potentially open some interesting stuff.
David
Topic | Posted By | Date |
---|---|---|
GPNPU? | --- | 2022/10/07 02:40 PM |
GPNPU? | --- | 2022/10/07 08:23 PM |
GPNPU? | Jörn Engel | 2022/10/08 10:02 AM |
GPNPU? | --- | 2022/10/08 11:38 AM |
GPNPU? | Jörn Engel | 2022/10/08 05:16 PM |
GPNPU? | dmcq | 2022/10/09 03:58 AM |
GPNPU? | David Kanter | 2022/10/08 10:16 PM |
GPNPU? | --- | 2022/10/09 06:08 PM |
What is NPU ? (NT) | Michael S | 2022/10/09 02:50 AM |
"Neural processing unit", AFAIU (NT) | Foo_ | 2022/10/09 03:22 AM |
Training, inference or both ? (NT) | Michael S | 2022/10/09 04:03 AM |
Network Processing Unit (NT) | anonymou5 | 2022/10/09 10:37 AM |
What is NPU ? | Will W | 2022/10/09 10:25 AM |
XTLA :-) (NT) | dmcq | 2022/10/09 12:51 PM |
GPNPU? | Etienne | 2022/10/11 01:12 AM |