By: Exophase (exophase.delete@this.gmail.com), October 1, 2015 10:07 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on October 1, 2015 4:49 pm wrote:
> For example, what happens if the pair loads target different pages?
> You'd need to do two separate translations through the TLB.
You make this relatively unusual case take 2-3 cycles.
Having a similar penalty for cacheline-crossing loads is still not a huge detriment. Even if you limited single-cycle performance to naturally aligned boundaries you would still get significantly better bang for your buck vs not having the instruction at all. And since you need decent 64/128-bit load support for SIMD it's kind of a given, the only catch is supporting the two register destinations.
> For example, what happens if the pair loads target different pages?
> You'd need to do two separate translations through the TLB.
You make this relatively unusual case take 2-3 cycles.
Having a similar penalty for cacheline-crossing loads is still not a huge detriment. Even if you limited single-cycle performance to naturally aligned boundaries you would still get significantly better bang for your buck vs not having the instruction at all. And since you need decent 64/128-bit load support for SIMD it's kind of a given, the only catch is supporting the two register destinations.