By: Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com), April 11, 2013 11:24 am
Room: Moderated Discussions
> The even more key point is that if processing includes a cache miss, then overlapping
> of them could cover nearly 100% of the latency. If you can get *two* memops in flight
> before you are stalled on a miss, then you can still cover nearly 100% of them.
I'm lost here
> actually Sandy Brdige will not have any magical device to make this usage redundant. Nor will Haswell.
for some reasons explicit prefetch provides even less speedup (my use cases) on Ivy Bridge than Nehalem (I miss Sandy Bridge data), best speedups for a single thread are around 5%, down to 0% with 8 threads, a complete waste of time, and now I even waste time discussing it on this forum, damned *%ç&! prefetch
> of them could cover nearly 100% of the latency. If you can get *two* memops in flight
> before you are stalled on a miss, then you can still cover nearly 100% of them.
I'm lost here
> actually Sandy Brdige will not have any magical device to make this usage redundant. Nor will Haswell.
for some reasons explicit prefetch provides even less speedup (my use cases) on Ivy Bridge than Nehalem (I miss Sandy Bridge data), best speedups for a single thread are around 5%, down to 0% with 8 threads, a complete waste of time, and now I even waste time discussing it on this forum, damned *%ç&! prefetch