By: Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com), April 11, 2013 10:30 am
Room: Moderated Discussions
> However if you traverse larger and more complex objects, for example, you may
> have known location of several memory addresses inherently in the object.
>
> object = list.ptr;
> prefetch(object->sub_object);
> prefetch(&object->far_from_start);
> prefetch(object->list.next);
> /*
> * At this point, you have 3 memops in flight. Even if do_something
> * has to wait for one of them, you still get the MLP which probably
> * can not be found by hardware prefetchers, and quite possibly will
> * not be found so early by the OOOE machine.
> */
> do_otherthing(object->sub_object);
> do_something(&object->far_from_start);
> object = object->list.next;
> } while (object);
>
> And now even better, we'll be able to prefetchw, which helps the above pattern quite a lot.
it remember a lot of tests I have done in the past, with frustrating <1 * speedups
the fact is that you typically must fetch several nodes (10 nodes or more) in advance for simple processing per node and in case of complex processing per node you're probably not latency bound, there is probably some workloads in between where it helps, I'll be interested to hear about the *actual* speedups people get
> have known location of several memory addresses inherently in the object.
>
> object = list.ptr;
> prefetch(object->sub_object);
> prefetch(&object->far_from_start);
> prefetch(object->list.next);
> /*
> * At this point, you have 3 memops in flight. Even if do_something
> * has to wait for one of them, you still get the MLP which probably
> * can not be found by hardware prefetchers, and quite possibly will
> * not be found so early by the OOOE machine.
> */
> do_otherthing(object->sub_object);
> do_something(&object->far_from_start);
> object = object->list.next;
> } while (object);
>
> And now even better, we'll be able to prefetchw, which helps the above pattern quite a lot.
it remember a lot of tests I have done in the past, with frustrating <1 * speedups
the fact is that you typically must fetch several nodes (10 nodes or more) in advance for simple processing per node and in case of complex processing per node you're probably not latency bound, there is probably some workloads in between where it helps, I'll be interested to hear about the *actual* speedups people get