By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), May 7, 2013 7:17 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on May 7, 2013 5:49 am wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on May 7, 2013 5:37 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on May 7, 2013 12:49 am wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on May 6, 2013 5:48 pm wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on May 6, 2013 3:59 pm wrote:
> > > >
> > > > > Thank you, David, good article.
> > > > >
> > > > > Two questions:
> > > > > 1. How Silvermont handles load-op and load-op-store x86 instructions. Are they cracked
> > > > > before ROB, consuming multiple ROB entries, or after ROBs, consuming just one entry?
> > > >
> > > > Instructions only take a single ROB entry, there is no cracking.
> > >
> > > First, a short rant.
> > > [rant on]
> > > You say there is no cracking, so, may be, they decided to call it fracking. But the
> > > procedure, equivalent to cracking has to be here, we see the need for it in the structure
> > > of pipeline and in fact that there is no buddy-ALU hiding near AGU.
> > > I think, there are no reasons why we should not use established term, i.e. cracking.
> > > [/rant off]
> >
> > Indeed you may need cracking for complex operations, and most definitely for microcoded instructions.
> >
> > > With a single ROB entry serving a whole x86 instruction, I wonder how further co-ordination
> > > between memory part(s) and ALU part of complex instruction is going on. For example, we have
> > > "add EAX, [EBX]". EBX value is available as well as AGU resources, but EAX is still unknown.
> > > Will load uOP be issued for execution or will it wait for availability of EAX?
> >
> > If it isn't cracked in decode, it will receive a temporary register during renaming and cracked
> > into two uops, likely when dispatching to reservation stations. So the LS unit sees a mov
> > TMP, [EBX] and the integer unit gets add EAX, TMP. The OoO machinery does the rest.
> >
>
> That's how machines with cracking-before-ROB operate.
> But would it still work for cracking-after-ROB?
It is similar indeed. A load-op may have its load dispatched to the memory reservation station in one cycle (possibly with a previous instruction), and in the next cycle it dispatches the ALU operation (possibly with a next instruction). That is simpler and more power efficient than cracking into uops much earlier (which would require larger buffers throughout).
Whether you crack early or late, a load+op effectively uses 2 cycles, just like you wrote a separate load and alu instruction. So it is silly to claim macro instructions improve performance or make your CPU appear wider like Anand did. Unlike the old Atom where load+op could actually improve performance, you now want to avoid them like all other x86 cores unless they are single-use.
> Looks like you'll need bigger ROB - up to 3 register inputs + temporary that starts life as an output
> then became an input + one genuine output. Unless they found some simplifying trick it looks ugly.
Yes, it would be wasteful to go for 4 inputs and 2 outputs just to model the temporary, so it is likely handled specially. Eg. for load+op it is always written once, then read once, and dead after that.
Wilco
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on May 7, 2013 5:37 am wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on May 7, 2013 12:49 am wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on May 6, 2013 5:48 pm wrote:
> > > > Michael S (already5chosen.delete@this.yahoo.com) on May 6, 2013 3:59 pm wrote:
> > > >
> > > > > Thank you, David, good article.
> > > > >
> > > > > Two questions:
> > > > > 1. How Silvermont handles load-op and load-op-store x86 instructions. Are they cracked
> > > > > before ROB, consuming multiple ROB entries, or after ROBs, consuming just one entry?
> > > >
> > > > Instructions only take a single ROB entry, there is no cracking.
> > >
> > > First, a short rant.
> > > [rant on]
> > > You say there is no cracking, so, may be, they decided to call it fracking. But the
> > > procedure, equivalent to cracking has to be here, we see the need for it in the structure
> > > of pipeline and in fact that there is no buddy-ALU hiding near AGU.
> > > I think, there are no reasons why we should not use established term, i.e. cracking.
> > > [/rant off]
> >
> > Indeed you may need cracking for complex operations, and most definitely for microcoded instructions.
> >
> > > With a single ROB entry serving a whole x86 instruction, I wonder how further co-ordination
> > > between memory part(s) and ALU part of complex instruction is going on. For example, we have
> > > "add EAX, [EBX]". EBX value is available as well as AGU resources, but EAX is still unknown.
> > > Will load uOP be issued for execution or will it wait for availability of EAX?
> >
> > If it isn't cracked in decode, it will receive a temporary register during renaming and cracked
> > into two uops, likely when dispatching to reservation stations. So the LS unit sees a mov
> > TMP, [EBX] and the integer unit gets add EAX, TMP. The OoO machinery does the rest.
> >
>
> That's how machines with cracking-before-ROB operate.
> But would it still work for cracking-after-ROB?
It is similar indeed. A load-op may have its load dispatched to the memory reservation station in one cycle (possibly with a previous instruction), and in the next cycle it dispatches the ALU operation (possibly with a next instruction). That is simpler and more power efficient than cracking into uops much earlier (which would require larger buffers throughout).
Whether you crack early or late, a load+op effectively uses 2 cycles, just like you wrote a separate load and alu instruction. So it is silly to claim macro instructions improve performance or make your CPU appear wider like Anand did. Unlike the old Atom where load+op could actually improve performance, you now want to avoid them like all other x86 cores unless they are single-use.
> Looks like you'll need bigger ROB - up to 3 register inputs + temporary that starts life as an output
> then became an input + one genuine output. Unless they found some simplifying trick it looks ugly.
Yes, it would be wasteful to go for 4 inputs and 2 outputs just to model the temporary, so it is likely handled specially. Eg. for load+op it is always written once, then read once, and dead after that.
Wilco