By: Simon Farnsworth (simon.delete@this.farnz.org.uk), July 16, 2015 1:05 pm
Room: Moderated Discussions
Jouni Osmala (josmala.delete@this.cc.hut.fi) on July 16, 2015 3:52 am wrote:
> >
> > You've misunderstood my argument, then. My argument is that a software author is faced with two choices:
> >
> >
>
> > Given that option 1 is slow even when people get it right, you'd expect software engineers
> > who care about performance at all to write option 2. However, option 2 is simply "is the input
> > aligned? If so, use fast instruction for aligned read, else use slow instruction for unaligned
> > read" - and I don't see why that burden can't be passed down to the hardware.
>
> Here's the problem. In the pipeline decision what to execute next happens several cycles
> before the address is calculated, so you cannot really take advantage of random speed
> ups on low cycle count instructions, like a load with L1 cache hit should be.
> So well pipelined hardware on modern processes cannot really make that decision efficiently. Of course they
> could assume its always aligned and reissue all the dependent instructions when guess is wrong, which should
> give multi cycle penalty on all unaligned accesses instead of that single cycle, and waste execution slots
> from other instructions. And having a predictor is also pretty costly solution for minimal potential gains,
> and in the end its still slower than just put everything through slow path and try make that as fast as possible,
> and assume every access is unaligned but doesn't cross cache-lines for scheduling purpose.
>
Of course hardware isn't perfect, and usually for good reasons. My argument is that these deficiencies (like no real world CPU offering sequential consistency) are just that - deficiencies that software has to live with because there are real world limits on the hardware, not hardware features that software should embrace. In a similar vein, it'd be nice if CPUs could offer zero-cost bounds checking (when accesses are in bounds - only pay a penalty if the bounds check fails); again, I understand why CPUs don't offer this, but it would still be nice to have.
Equally, because these are things that would be nice to have, I can hope that a hardware designer has the sort of epiphany that (for example) enabled them to replace clustered message passing systems with NUMA, or that enabled Intel's SNB and later CPUs to have no penalty for unaligned accesses. Maybe one day, a designer will come up with an innovative way to have sequential consistency in a large system, or to get bounds checking at no penalty.
> >
> > You've misunderstood my argument, then. My argument is that a software author is faced with two choices:
> >
> >
- Use the slow unaligned operations anywhere where it's possible that another component of the system
> > has mistakenly given you unaligned arguments. Effectively, this means that every argument has to be
> > copied on entry to a function, because it's possible for the final dereference of a pointer to be a
> > long way from its origin; for example, a caller could go "oh, this is always the 4th payload byte in
> > an IP packet, therefore it's 4 byte aligned", and not be aware that it's been called with a pointer
> > to an IP packet embedded in an Ethernet frame, where the start of the Ethernet frame is aligned. You
> > will notice that there is no correctness issue arising from using an unaligned read here.
> > - Use the fast aligned operations, and check that your callers get it right - this means extra code
> > whenever you're passed data from outside to confirm that they've considered alignment properly,
> > and switch to a slow path if they haven't, so that when someone passes (ethernet_frame_start + 14)
> > to the IPv4 packet handler as ip_packet_start, and it passes (ip_packet_start + IHL * 5) to the
> > payload handler, you get an unaligned read, rather than a fast aligned read. Note that the read
> > for IHL will be naturally aligned, even if the packet start itself is not, as IHL is a 4 bit field
> > on a 4 bit boundary, and I've pushed the packet out by 2 bytes from perfect alignment.
> >
> >
>
> > Given that option 1 is slow even when people get it right, you'd expect software engineers
> > who care about performance at all to write option 2. However, option 2 is simply "is the input
> > aligned? If so, use fast instruction for aligned read, else use slow instruction for unaligned
> > read" - and I don't see why that burden can't be passed down to the hardware.
>
> Here's the problem. In the pipeline decision what to execute next happens several cycles
> before the address is calculated, so you cannot really take advantage of random speed
> ups on low cycle count instructions, like a load with L1 cache hit should be.
> So well pipelined hardware on modern processes cannot really make that decision efficiently. Of course they
> could assume its always aligned and reissue all the dependent instructions when guess is wrong, which should
> give multi cycle penalty on all unaligned accesses instead of that single cycle, and waste execution slots
> from other instructions. And having a predictor is also pretty costly solution for minimal potential gains,
> and in the end its still slower than just put everything through slow path and try make that as fast as possible,
> and assume every access is unaligned but doesn't cross cache-lines for scheduling purpose.
>
Of course hardware isn't perfect, and usually for good reasons. My argument is that these deficiencies (like no real world CPU offering sequential consistency) are just that - deficiencies that software has to live with because there are real world limits on the hardware, not hardware features that software should embrace. In a similar vein, it'd be nice if CPUs could offer zero-cost bounds checking (when accesses are in bounds - only pay a penalty if the bounds check fails); again, I understand why CPUs don't offer this, but it would still be nice to have.
Equally, because these are things that would be nice to have, I can hope that a hardware designer has the sort of epiphany that (for example) enabled them to replace clustered message passing systems with NUMA, or that enabled Intel's SNB and later CPUs to have no penalty for unaligned accesses. Maybe one day, a designer will come up with an innovative way to have sequential consistency in a large system, or to get bounds checking at no penalty.