By: EduardoS (no.delete@this.spam.com), July 29, 2012 10:26 am
Room: Moderated Discussions
none (none.delete@this.none.com) on July 29, 2012 9:05 am wrote:
> It depends on
> what you call "simple". daxpy requires 2 LD / 1 ST for 1 FMA. So one C2050
> being 515 G FMA/s according to Wikipedia, I'd say it's memory bandwidth limited
> on daxpy.
But daxpy usually (always?) is only part fo computation, when you add the other parts the LD/ST can be merged with LD/ST from other steps reducing bandwidth requirements.
Of course, when aaron says GPUs struggles to get 50% peak in Linpack he is talking about large clusters multiplying large matrices where PCIe is the bottleneck, not about single GPU doing small matrices where any teenager can pass the 80% peak mark.
> It depends on
> what you call "simple". daxpy requires 2 LD / 1 ST for 1 FMA. So one C2050
> being 515 G FMA/s according to Wikipedia, I'd say it's memory bandwidth limited
> on daxpy.
But daxpy usually (always?) is only part fo computation, when you add the other parts the LD/ST can be merged with LD/ST from other steps reducing bandwidth requirements.
Of course, when aaron says GPUs struggles to get 50% peak in Linpack he is talking about large clusters multiplying large matrices where PCIe is the bottleneck, not about single GPU doing small matrices where any teenager can pass the 80% peak mark.



