By: Jouni Osmala (email@example.com), July 17, 2015 12:18 am
Room: Moderated Discussions
> > > Given that option 1 is slow even when people get it right, you'd expect software engineers
> > > who care about performance at all to write option 2. However, option 2 is simply "is the input
> > > aligned? If so, use fast instruction for aligned read, else use slow instruction for unaligned
> > > read" - and I don't see why that burden can't be passed down to the hardware.
> > Here's the problem. In the pipeline decision what to execute next happens several cycles
> > before the address is calculated, so you cannot really take advantage of random speed
> > ups on low cycle count instructions, like a load with L1 cache hit should be.
> > So well pipelined hardware on modern processes cannot really make that decision efficiently. Of course they
> > could assume its always aligned and reissue all the dependent
> > instructions when guess is wrong, which should
> > give multi cycle penalty on all unaligned accesses instead of that single cycle, and waste execution slots
> > from other instructions. And having a predictor is also pretty costly solution for minimal potential gains,
> > and in the end its still slower than just put everything through
> > slow path and try make that as fast as possible,
> > and assume every access is unaligned but doesn't cross cache-lines for scheduling purpose.
> Of course hardware isn't perfect, and usually for good reasons. My argument is that these deficiencies
> (like no real world CPU offering sequential consistency) are just that - deficiencies that software
> has to live with because there are real world limits on the hardware, not hardware features that
> software should embrace. In a similar vein, it'd be nice if CPUs could offer zero-cost bounds
> checking (when accesses are in bounds - only pay a penalty if the bounds check fails); again,
> I understand why CPUs don't offer this, but it would still be nice to have.
> Equally, because these are things that would be nice to have, I can hope that a hardware
> designer has the sort of epiphany that (for example) enabled them to replace clustered message
> passing systems with NUMA, or that enabled Intel's SNB and later CPUs to have no penalty
> for unaligned accesses. Maybe one day, a designer will come up with an innovative way to
> have sequential consistency in a large system, or to get bounds checking at no penalty.
I would say that penalty for unaligned accesses is there but its eaten by aligned accesses too. Intel has pretty high L1Dcache latency for its size and that the cost of all the features Intel puts in its memory accesses.
Additional hardware is only "almost free" if its parallel to existing hardware and small, and doesn't require additional inputs from registers, and doesn't issue dependency between earlier pipeline stage and later pipeline stage, and you can fit it inside an existing execution unit.
I'd say that I'm no expert of physical limitations of hardware. Most problems in software peoples ideas are obvious from logical design. The common pitfalls for pure software people is lets force O(n^2) algorithm to have always 2n instead of n as input because it makes few things percentage point or two cheaper in software, or assumes earlier pipeline stage has same information as execute stage, or just adds more sequential hardware operations to a common operation.
Another thing that is forgotten is that hardware building blocks has to handle worst case situation, not average case situation.
I believe haswell made the best possible addition to speed up bounds checking. Increasing the branch unit from one to two. Having already compare and branch as fused operation means that at execution stage it doesn't take more than one additional operation.