By: Brendan (btrotter.delete@this.gmail.com), April 13, 2013 2:20 am
Room: Moderated Discussions
Hi,
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 12, 2013 10:44 pm wrote:
> > It seems easy to me - just have a "next" field pointing to the next structure in the list,
> > plus an additional "next_prefetch" field pointing to the structure several nodes ahead.
>
> the prefetch scheduling distance (nr of iterations ahead) is target dependant so I don't really see
> how you can do that, anyway it will defeat the very purpose of linked lists: easy insertion/deletion
> etc. + increase your node size (8B per pointer with 64-bit code) just for prefetch, no, please...
How hard/easy it is depends on what operations you need to do (e.g. if you only ever append on one end and delete from the other, and iterate). It also depends on how much time you spend at each node (if you spend about 300 cycles processing each node then you'd only need to prefetch the next node).
The "next_prefetch" field doesn't need to be 64-bit (could be 32-bit for a lot of software), and even if it is those extra bytes are likely wasted due to alignment (if you're trying to align to cache line boundaries).
> also don't forget the problem with NULL values, to avoid to prefetch NULLs you'll
> need one extra branch pre prefetch, you will waste BTB space, it will delay
> further the prefetch retirement and you'll endure more branch misses
Why use NULL? Use the address of something that's likely to be in cache anyway (e.g. the address of the current structure).
> > I would've assumed "prefetcht0". Sadly, Intel's instruction set reference
> > only gives details for Pentium III and Pentium 4/Xeon.
>
> IIRC prefetcht0 was prefetching to L2 on Pentium 4 and indeed there is no documentation
> for more recent processors, on Ivy Bride all variants give the same timings
> in my workloads for the cases where soft prefetch help (a bit)
Pentium 4/Netburst was a strange thing and I'd assume that Pentium III behaviour is a little more likely for a more modern CPU's "prefetcht0" behaviour. Of course "completely different to all older/existing CPUs/documentation" is possibly even more likely.
I don't know anything about your "cases where soft prefetch help (a bit)". Usually you'd design some sort of micro-benchmark to isolate the behaviour you want from all the other factors that mess up results.
- Brendan
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 12, 2013 10:44 pm wrote:
> > It seems easy to me - just have a "next" field pointing to the next structure in the list,
> > plus an additional "next_prefetch" field pointing to the structure several nodes ahead.
>
> the prefetch scheduling distance (nr of iterations ahead) is target dependant so I don't really see
> how you can do that, anyway it will defeat the very purpose of linked lists: easy insertion/deletion
> etc. + increase your node size (8B per pointer with 64-bit code) just for prefetch, no, please...
How hard/easy it is depends on what operations you need to do (e.g. if you only ever append on one end and delete from the other, and iterate). It also depends on how much time you spend at each node (if you spend about 300 cycles processing each node then you'd only need to prefetch the next node).
The "next_prefetch" field doesn't need to be 64-bit (could be 32-bit for a lot of software), and even if it is those extra bytes are likely wasted due to alignment (if you're trying to align to cache line boundaries).
> also don't forget the problem with NULL values, to avoid to prefetch NULLs you'll
> need one extra branch pre prefetch, you will waste BTB space, it will delay
> further the prefetch retirement and you'll endure more branch misses
Why use NULL? Use the address of something that's likely to be in cache anyway (e.g. the address of the current structure).
> > I would've assumed "prefetcht0". Sadly, Intel's instruction set reference
> > only gives details for Pentium III and Pentium 4/Xeon.
>
> IIRC prefetcht0 was prefetching to L2 on Pentium 4 and indeed there is no documentation
> for more recent processors, on Ivy Bride all variants give the same timings
> in my workloads for the cases where soft prefetch help (a bit)
Pentium 4/Netburst was a strange thing and I'd assume that Pentium III behaviour is a little more likely for a more modern CPU's "prefetcht0" behaviour. Of course "completely different to all older/existing CPUs/documentation" is possibly even more likely.
I don't know anything about your "cases where soft prefetch help (a bit)". Usually you'd design some sort of micro-benchmark to isolate the behaviour you want from all the other factors that mess up results.
- Brendan