By: Ricardo B (ricardo.b.delete@this.xxxxx.xx), November 5, 2006 9:09 pm
Room: Moderated Discussions
Wilco (Wilco.Dijkstra@ntlworld.com) on 11/5/06 wrote:
---------------------------
>These complex cases often need handling anyway for other reasons and so this not
>a problem created by unaligned access. For example many architectures have some
>form of multiple load/store for codesize and performance reasons. These can cross
>cachelines and page boundaries without any problems. They typically reexecute from scratch after faulting.
Which architectures have such cases and don't support unaligned loads?
>>MIPS' two instruction solution avoids that problem.
>
>I don't see how that helps. An unaligned load can be split internally into two
>independent accesses like the MIPS instructions. And that is how CPUs deal with the complex cases.
The problem is that you have a single instruction generating two accesses but it must be visible as as single operation (I'm don't mean atomic).
Consider a worse case scenario: an unaligned store that crosses both a cache line boundary and a page boundary:
- None, one or both of the cache lines may be a store miss
- None, one or both of the pages may cause a trap
And the microarchitecture has to handle every possible case transparently.
You can't end up with a software visible state where one of the accesses has been performed but the other hasn't or has failed.
Of course, it's not that big of a burden. But it's a burden some consider excessive for simpler processors.
With MIPS' solution, the microarchitecture doesn't have to handle such cases, because the two different accesses come from two different instructions and thus, it's consequences can be visible independently.
>>Another solution is the POWER and IPF aproach: support the simpler cases in hardware,
>>trap the complex ones in software.
>
>Trapping complex cases only works if they are extremely rare.
>Users have little
>control over data placement, so imagine crossing a page boundary inside a critical
>loop and suddenly get a 1000x slowdown...
Which cases are suppored by hardware and which are trapped aren't cast in stone for either ISA -- designers are mostly free to make that tradeoff for each microarchitecture as they see fit.
---------------------------
>These complex cases often need handling anyway for other reasons and so this not
>a problem created by unaligned access. For example many architectures have some
>form of multiple load/store for codesize and performance reasons. These can cross
>cachelines and page boundaries without any problems. They typically reexecute from scratch after faulting.
Which architectures have such cases and don't support unaligned loads?
>>MIPS' two instruction solution avoids that problem.
>
>I don't see how that helps. An unaligned load can be split internally into two
>independent accesses like the MIPS instructions. And that is how CPUs deal with the complex cases.
The problem is that you have a single instruction generating two accesses but it must be visible as as single operation (I'm don't mean atomic).
Consider a worse case scenario: an unaligned store that crosses both a cache line boundary and a page boundary:
- None, one or both of the cache lines may be a store miss
- None, one or both of the pages may cause a trap
And the microarchitecture has to handle every possible case transparently.
You can't end up with a software visible state where one of the accesses has been performed but the other hasn't or has failed.
Of course, it's not that big of a burden. But it's a burden some consider excessive for simpler processors.
With MIPS' solution, the microarchitecture doesn't have to handle such cases, because the two different accesses come from two different instructions and thus, it's consequences can be visible independently.
>>Another solution is the POWER and IPF aproach: support the simpler cases in hardware,
>>trap the complex ones in software.
>
>Trapping complex cases only works if they are extremely rare.
>Users have little
>control over data placement, so imagine crossing a page boundary inside a critical
>loop and suddenly get a 1000x slowdown...
Which cases are suppored by hardware and which are trapped aren't cast in stone for either ISA -- designers are mostly free to make that tradeoff for each microarchitecture as they see fit.