By: Jouni Osmala (josmala.delete@this.cc.hut.fi), July 16, 2015 3:52 am
Room: Moderated Discussions
>
> You've misunderstood my argument, then. My argument is that a software author is faced with two choices:
>
>
> Given that option 1 is slow even when people get it right, you'd expect software engineers who care about performance at all to write option 2. However, option 2 is simply "is the input aligned? If so, use fast instruction for aligned read, else use slow instruction for unaligned read" - and I don't see why that burden can't be passed down to the hardware.
Here's the problem. In the pipeline decision what to execute next happens several cycles before the address is calculated, so you cannot really take advantage of random speed ups on low cycle count instructions, like a load with L1 cache hit should be.
So well pipelined hardware on modern processes cannot really make that decision efficiently. Of course they could assume its always aligned and reissue all the dependent instructions when guess is wrong, which should give multi cycle penalty on all unaligned accesses instead of that single cycle, and waste execution slots from other instructions. And having a predictor is also pretty costly solution for minimal potential gains, and in the end its still slower than just put everything through slow path and try make that as fast as possible, and assume every access is unaligned but doesn't cross cache-lines for scheduling purpose.
> You've misunderstood my argument, then. My argument is that a software author is faced with two choices:
>
>
- Use the slow unaligned operations anywhere where it's possible that another component of the system
> has mistakenly given you unaligned arguments. Effectively, this means that every argument has to be
> copied on entry to a function, because it's possible for the final dereference of a pointer to be a
> long way from its origin; for example, a caller could go "oh, this is always the 4th payload byte in
> an IP packet, therefore it's 4 byte aligned", and not be aware that it's been called with a pointer
> to an IP packet embedded in an Ethernet frame, where the start of the Ethernet frame is aligned. You
> will notice that there is no correctness issue arising from using an unaligned read here.
> - Use the fast aligned operations, and check that your callers get it right - this means extra code
> whenever you're passed data from outside to confirm that they've considered alignment properly,
> and switch to a slow path if they haven't, so that when someone passes (ethernet_frame_start + 14)
> to the IPv4 packet handler as ip_packet_start, and it passes (ip_packet_start + IHL * 5) to the
> payload handler, you get an unaligned read, rather than a fast aligned read. Note that the read
> for IHL will be naturally aligned, even if the packet start itself is not, as IHL is a 4 bit field
> on a 4 bit boundary, and I've pushed the packet out by 2 bytes from perfect alignment.
>
>
> Given that option 1 is slow even when people get it right, you'd expect software engineers who care about performance at all to write option 2. However, option 2 is simply "is the input aligned? If so, use fast instruction for aligned read, else use slow instruction for unaligned read" - and I don't see why that burden can't be passed down to the hardware.
Here's the problem. In the pipeline decision what to execute next happens several cycles before the address is calculated, so you cannot really take advantage of random speed ups on low cycle count instructions, like a load with L1 cache hit should be.
So well pipelined hardware on modern processes cannot really make that decision efficiently. Of course they could assume its always aligned and reissue all the dependent instructions when guess is wrong, which should give multi cycle penalty on all unaligned accesses instead of that single cycle, and waste execution slots from other instructions. And having a predictor is also pretty costly solution for minimal potential gains, and in the end its still slower than just put everything through slow path and try make that as fast as possible, and assume every access is unaligned but doesn't cross cache-lines for scheduling purpose.