By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), July 30, 2012 6:10 am
Room: Moderated Discussions
aaron spink (aaronspink.delete@this.notearthlink.net) on July 30, 2012 3:03 am wrote:
> none (none.delete@this.none.com) on July 29, 2012 9:05 am wrote:
>> It depends on what you call "simple". daxpy requires
>> 2 LD / 1 ST for 1 FMA. So one C2050 being 515 G FMA/s
>> according to Wikipedia, I'd say it's memory bandwidth limited
>> on daxpy.
>
> That would be the most naive linpack implementation ever.
> With proper data structures you are at much less
> than 1B/flop.
'none' was indicating that there are "simple algorithms" that would be bandwidth-limited even at 50% of peak FLOPS. (Since daxpy is a STREAM component, it should not be surprising that it is bandwidth limited even with low FLOPS efficiency.)
'none' was not suggesting a daxpy-like (i.e., not blocked) implementation of linpack.
ISTM that GPGPU would have some useful applications even if it was limited to workstation-scale problems. With the decline of discrete GPU sales, the economics might approach that of GRAPE (which from the little I read about it was a nice architecture for a specific type of problem).
As a side thought, could some of the base station DSPs have some non-standard uses? (I am guessing that such only support single precision, but I thought they have fairly high compute density and are perhaps less vulnerable to volume issues.) I would guess that the applicable niches are too small to support the costs of exploiting such and the rate of change is too great (and the small volume of alternate uses would not make consideration of alternate uses a factor in newer designs).
> none (none.delete@this.none.com) on July 29, 2012 9:05 am wrote:
>> It depends on what you call "simple". daxpy requires
>> 2 LD / 1 ST for 1 FMA. So one C2050 being 515 G FMA/s
>> according to Wikipedia, I'd say it's memory bandwidth limited
>> on daxpy.
>
> That would be the most naive linpack implementation ever.
> With proper data structures you are at much less
> than 1B/flop.
'none' was indicating that there are "simple algorithms" that would be bandwidth-limited even at 50% of peak FLOPS. (Since daxpy is a STREAM component, it should not be surprising that it is bandwidth limited even with low FLOPS efficiency.)
'none' was not suggesting a daxpy-like (i.e., not blocked) implementation of linpack.
ISTM that GPGPU would have some useful applications even if it was limited to workstation-scale problems. With the decline of discrete GPU sales, the economics might approach that of GRAPE (which from the little I read about it was a nice architecture for a specific type of problem).
As a side thought, could some of the base station DSPs have some non-standard uses? (I am guessing that such only support single precision, but I thought they have fairly high compute density and are perhaps less vulnerable to volume issues.) I would guess that the applicable niches are too small to support the costs of exploiting such and the rate of change is too great (and the small volume of alternate uses would not make consideration of alternate uses a factor in newer designs).



