By: Jörn Engel (joern.delete@this.purestorage.com), May 29, 2022 2:10 pm
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 29, 2022 1:38 pm wrote:
>
> And if your workload is not "dense" then you are probably limited by bandwidth of one or another cache/memory
> level and can't take advantage of amount of FLOPs provided by good old AVX+FMA, much less so by AVX-512.
Not everything is floating point. If I look at my code, almost none of it is. And the rare exceptions tends to exist because I haven't converted them to integer math yet. Anything that is performance-relevant has been converted.
Memory bandwidth doesn't matter much either. If you divide your work into L1-sized chunks, you have enough bandwidth to become compute-limited. And it often helps to even partially vectorize your code.
for (all data) {
do_one();
do_other();
}
Can become
for_vectorized (all data) {
do_one_vectorized();
}
for (all data) {
do_other();
}
As long as all the data shared between the two loops fits into L1 (or you split things into L1-sized chunks), you don't have to worry about bandwidth. And if you do more interesting work than FMA, you may be rather compute-heavy anyway and care little about bandwidth.
>
> And if your workload is not "dense" then you are probably limited by bandwidth of one or another cache/memory
> level and can't take advantage of amount of FLOPs provided by good old AVX+FMA, much less so by AVX-512.
Not everything is floating point. If I look at my code, almost none of it is. And the rare exceptions tends to exist because I haven't converted them to integer math yet. Anything that is performance-relevant has been converted.
Memory bandwidth doesn't matter much either. If you divide your work into L1-sized chunks, you have enough bandwidth to become compute-limited. And it often helps to even partially vectorize your code.
for (all data) {
do_one();
do_other();
}
Can become
for_vectorized (all data) {
do_one_vectorized();
}
for (all data) {
do_other();
}
As long as all the data shared between the two loops fits into L1 (or you split things into L1-sized chunks), you don't have to worry about bandwidth. And if you do more interesting work than FMA, you may be rather compute-heavy anyway and care little about bandwidth.