By: Patrick Chase (patrickjchase.delete@this.gmail.com), February 4, 2013 4:43 pm
Room: Moderated Discussions
none (none.delete@this.none.com) on February 4, 2013 4:12 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 2:01 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 1:13 pm wrote:
> > > none (none.delete@this.none.com) on February 2, 2013 10:43 am wrote:
> > > > That 24 entries figure is wrong and anyway A9 has no ROB as found in other OoO CPU :)
> > >
> > > ARM describes the A9 as doing "out of order issue". The TRM further describes it as using register
> > > renaming to resolve WAW/WAR hazards without stalling, which implies a Tomasulo machine or similar.
> > > The ARM ISA requires precise exceptions. I've developed OS code including fault handlers and
> > > context-switching for A9, so I *know* it implements precise exceptions.
> > >
> > > I'm not aware of a means of implementing out-of-order issue with non-stalling resolution of WAR/WAW
> > > and precise exceptions without some structure equivalent to an ROB in function if not in name.
> > > I'm aware of the ARM slideset that says that A9 does OoO "without a power-hungry ROB" but I suspect
> > > that's misworded and simply means that they used a physical register file to avoid Tomasulo's
> > > reservation stations and common results bus (just like Sandy/Ivy Bridge and many other recent
> > > OoO microarchitectures). The ROB itself isn't particularly power-hungry.
> > >
> > > Can you explain how A9 achieves out-of-order issue with renaming
> > > and precise exceptions without an ROB or equivalent?
> >
> > Sorry to follow-up my own post again, but...
> >
> > I did some digging and found multiple sources that refer to a 24-entry "data-less
> > ROB" in A9. Unfortunately ARM has pulled down the original documents (particularly
> > the devcon 2007 A9 architecture slides) that those sources refer to.
>
> The 24-entry comes from people believing that 56 physical
> registers for 32 architectural ones means 24 entries :-)
>
> > That tends to reinforce what I hypothesized above: It uses a PRF instead of reservation stations and
> > a common results bus, so the ROB only needs to track instruction order and state (speculative or not,
> > written back to PRF or not) as opposed to instruction results Hence "data-less", just like Sandy Bridge,
> > Bobcat, and a whole lot of other modern OoO microarchitectures :-). It's still an ROB, though.
>
> A very very simple one then, which is what I meant ;)
So then why did you say "the A9 has no ROB"?
There have been many microarchitectures with this scheme (PRF + dataless ROB), going back to the MIPS R10000 and Alpha 21264, and probably before that. The convention in the literature has always been to refer to those structures as "reorder buffers", and perhaps clarify that they are dataless. See for example figure 5 here:
http://www.ecs.umass.edu/ece/koren/ece568/papers/Pentium4.pdf
I suspect that the reason for keeping the "ROB" terminology is because reorder buffers were devised to implement precise exceptions in the basic Tomasulo architecture. Whether they also store results or not is secondary to that basic function. See for example the seminal paper on the topic:
http://dl.acm.org/citation.cfm?id=327125, also available for download here: http://lmi17.cnam.fr/~anceau/Documents/smith.pdf
While ARM may have claimed that the A9 was "ROB-less" to make it look new and revolutionary, they merely made themselves look silly. To my knowledge (which is quire fallible - I'd love to be corrected on this one) you can't efficiently do precise exceptions in an OoO machine [*] without an ordered list of pending instructions and their status, and that is by definition a reorder buffer.
So, now that we've established conclusively that A9 has an ROB, how many entries do you think it has?
[*] I can conceive of some very inefficient mechanisms, but those basically come down to reconstructing the equivalent of an ROB after the fact...
> Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 2:01 pm wrote:
> > Patrick Chase (patrickjchase.delete@this.gmail.com) on February 2, 2013 1:13 pm wrote:
> > > none (none.delete@this.none.com) on February 2, 2013 10:43 am wrote:
> > > > That 24 entries figure is wrong and anyway A9 has no ROB as found in other OoO CPU :)
> > >
> > > ARM describes the A9 as doing "out of order issue". The TRM further describes it as using register
> > > renaming to resolve WAW/WAR hazards without stalling, which implies a Tomasulo machine or similar.
> > > The ARM ISA requires precise exceptions. I've developed OS code including fault handlers and
> > > context-switching for A9, so I *know* it implements precise exceptions.
> > >
> > > I'm not aware of a means of implementing out-of-order issue with non-stalling resolution of WAR/WAW
> > > and precise exceptions without some structure equivalent to an ROB in function if not in name.
> > > I'm aware of the ARM slideset that says that A9 does OoO "without a power-hungry ROB" but I suspect
> > > that's misworded and simply means that they used a physical register file to avoid Tomasulo's
> > > reservation stations and common results bus (just like Sandy/Ivy Bridge and many other recent
> > > OoO microarchitectures). The ROB itself isn't particularly power-hungry.
> > >
> > > Can you explain how A9 achieves out-of-order issue with renaming
> > > and precise exceptions without an ROB or equivalent?
> >
> > Sorry to follow-up my own post again, but...
> >
> > I did some digging and found multiple sources that refer to a 24-entry "data-less
> > ROB" in A9. Unfortunately ARM has pulled down the original documents (particularly
> > the devcon 2007 A9 architecture slides) that those sources refer to.
>
> The 24-entry comes from people believing that 56 physical
> registers for 32 architectural ones means 24 entries :-)
>
> > That tends to reinforce what I hypothesized above: It uses a PRF instead of reservation stations and
> > a common results bus, so the ROB only needs to track instruction order and state (speculative or not,
> > written back to PRF or not) as opposed to instruction results Hence "data-less", just like Sandy Bridge,
> > Bobcat, and a whole lot of other modern OoO microarchitectures :-). It's still an ROB, though.
>
> A very very simple one then, which is what I meant ;)
So then why did you say "the A9 has no ROB"?
There have been many microarchitectures with this scheme (PRF + dataless ROB), going back to the MIPS R10000 and Alpha 21264, and probably before that. The convention in the literature has always been to refer to those structures as "reorder buffers", and perhaps clarify that they are dataless. See for example figure 5 here:
http://www.ecs.umass.edu/ece/koren/ece568/papers/Pentium4.pdf
I suspect that the reason for keeping the "ROB" terminology is because reorder buffers were devised to implement precise exceptions in the basic Tomasulo architecture. Whether they also store results or not is secondary to that basic function. See for example the seminal paper on the topic:
http://dl.acm.org/citation.cfm?id=327125, also available for download here: http://lmi17.cnam.fr/~anceau/Documents/smith.pdf
While ARM may have claimed that the A9 was "ROB-less" to make it look new and revolutionary, they merely made themselves look silly. To my knowledge (which is quire fallible - I'd love to be corrected on this one) you can't efficiently do precise exceptions in an OoO machine [*] without an ordered list of pending instructions and their status, and that is by definition a reorder buffer.
So, now that we've established conclusively that A9 has an ROB, how many entries do you think it has?
[*] I can conceive of some very inefficient mechanisms, but those basically come down to reconstructing the equivalent of an ROB after the fact...