Article: AMD's Jaguar Microarchitecture
By: Brett (ggtgp.delete@this.yahoo.com), April 6, 2014 10:08 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on April 6, 2014 2:18 am wrote:
> Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 1:00 pm wrote:
> > EduardoS (no.delete@this.spam.com) on April 5, 2014 12:29 pm wrote:
> > > Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 12:12 pm wrote:
> > > > You have ~100 instructions in the speculation queue and just changed the FPU mode,
> > > > your FPU opcode picker is grabbing random instructions before and after the FPU
> > > > mode change and executing them as load dependancies become available.
> > > >
> > > > I would be some (most?) CPU's just flush the queue, expensive.
> > >
> > > Internally makes multiple copies of each FPU instruction
> > > for each mode, when decoding FPU instructions look at
> > > the control register to issue the correct one, when Reading
> > > a "change control register" instruction halt the FPU
> > > decoder until that instruction is executed (even if speculatively),
> > > no pipeline/queue flush or whatever, ok, it
> > > is more expensive than a simple nop, but not as expensive as the 100 flushed instructions you implied.
> >
> > There are lot of ways to fix this, including your partial fix.
> > Flushing the queue is easy and already implemented for other uses, doing things "right"
> > costs engineering time, verification time, transistors, heat, etc. Considering how rare
> > FPU mode changes are I would argue flushing is the right thing to do on just about every
> > metric except the one where programmers want to FPU mode switch on every function call.
> >
> > Historical baggage is a killer of new better ideas.
> >
> > So what do modern Intel and IBM chips do on a FPU mode change,
> > someone undoubtably cares and has benchmarked it.
> >
>
> On big Intel cores the write to MXCSR (SSE status/control register) is serializing, which
> I interpret as pipeline drain rather than pipeline flush that you implied in a post above.
> Reads are also slow, but not quite as bad. I don't know how slow exactly.
> On Bonnel/Saltwell writes to MXCSR are relatively quick (4 clocks), but reads are slow (15 clocks).
I clearly used the wrong word when I said flush, sorry, I did mean that you have to wait for all previous FPU opcodes to finish, drain. Which is probably not that bad, the instruction decoder can keep running and branch predictor, etc. You are just waiting for in order instruction commit complete to get to the MXCSR write instruction, before issuing FPU instructions after the MXCSR write.
4 clocks to write MXCSR matches the FPU pipeline execute length, (Add/mul but not divide) so it is actually waiting for the adder/mul to empty before setting state on the whole four stages at once.
This implies a separate check to see if the divider is in use, and waiting longer if it is.
15 clocks to read MXCSR would be for gathering FPU opcode exception information from the last operation I would guess. Not just the simple state info which would take one cycle, instead the six sticky flags.
This matches the divide instruction length, which might be the most likely use, to check if the divide results were valid?
Could return early if there is no divide in the pipe.
Overall I would say that as long as you are executing at least 100 instructions between setting MXCSR then the overhead is in the noise. I am hard pressed to think of real pathological cases where things go bad and performance suffers.
> On the other hand, write to x87 control word (FCW) on big Intel cores can
> be rather fast if you follow certain rules (osculate between two constant
> FCW values, which happens naturally in our case of library entry/exit).
Which brings up the question of does every math library call set MXCSR. (My math libs never call MXCSR, the hardware I used defaulted to bad but fast math.)
Intel really should have put the sticky bits in another register.
> I don't know how fast/slow it works on AMD, Intel Silvermont and on various IBM Power processors.
> Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 1:00 pm wrote:
> > EduardoS (no.delete@this.spam.com) on April 5, 2014 12:29 pm wrote:
> > > Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 12:12 pm wrote:
> > > > You have ~100 instructions in the speculation queue and just changed the FPU mode,
> > > > your FPU opcode picker is grabbing random instructions before and after the FPU
> > > > mode change and executing them as load dependancies become available.
> > > >
> > > > I would be some (most?) CPU's just flush the queue, expensive.
> > >
> > > Internally makes multiple copies of each FPU instruction
> > > for each mode, when decoding FPU instructions look at
> > > the control register to issue the correct one, when Reading
> > > a "change control register" instruction halt the FPU
> > > decoder until that instruction is executed (even if speculatively),
> > > no pipeline/queue flush or whatever, ok, it
> > > is more expensive than a simple nop, but not as expensive as the 100 flushed instructions you implied.
> >
> > There are lot of ways to fix this, including your partial fix.
> > Flushing the queue is easy and already implemented for other uses, doing things "right"
> > costs engineering time, verification time, transistors, heat, etc. Considering how rare
> > FPU mode changes are I would argue flushing is the right thing to do on just about every
> > metric except the one where programmers want to FPU mode switch on every function call.
> >
> > Historical baggage is a killer of new better ideas.
> >
> > So what do modern Intel and IBM chips do on a FPU mode change,
> > someone undoubtably cares and has benchmarked it.
> >
>
> On big Intel cores the write to MXCSR (SSE status/control register) is serializing, which
> I interpret as pipeline drain rather than pipeline flush that you implied in a post above.
> Reads are also slow, but not quite as bad. I don't know how slow exactly.
> On Bonnel/Saltwell writes to MXCSR are relatively quick (4 clocks), but reads are slow (15 clocks).
I clearly used the wrong word when I said flush, sorry, I did mean that you have to wait for all previous FPU opcodes to finish, drain. Which is probably not that bad, the instruction decoder can keep running and branch predictor, etc. You are just waiting for in order instruction commit complete to get to the MXCSR write instruction, before issuing FPU instructions after the MXCSR write.
4 clocks to write MXCSR matches the FPU pipeline execute length, (Add/mul but not divide) so it is actually waiting for the adder/mul to empty before setting state on the whole four stages at once.
This implies a separate check to see if the divider is in use, and waiting longer if it is.
15 clocks to read MXCSR would be for gathering FPU opcode exception information from the last operation I would guess. Not just the simple state info which would take one cycle, instead the six sticky flags.
This matches the divide instruction length, which might be the most likely use, to check if the divide results were valid?
Could return early if there is no divide in the pipe.
Overall I would say that as long as you are executing at least 100 instructions between setting MXCSR then the overhead is in the noise. I am hard pressed to think of real pathological cases where things go bad and performance suffers.
> On the other hand, write to x87 control word (FCW) on big Intel cores can
> be rather fast if you follow certain rules (osculate between two constant
> FCW values, which happens naturally in our case of library entry/exit).
Which brings up the question of does every math library call set MXCSR. (My math libs never call MXCSR, the hardware I used defaulted to bad but fast math.)
Intel really should have put the sticky bits in another register.
> I don't know how fast/slow it works on AMD, Intel Silvermont and on various IBM Power processors.