Article: AMD's Jaguar Microarchitecture
By: Michael S (already5chosen.delete@this.yahoo.com), April 6, 2014 1:18 am
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 1:00 pm wrote:
> EduardoS (no.delete@this.spam.com) on April 5, 2014 12:29 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 12:12 pm wrote:
> > > You have ~100 instructions in the speculation queue and just changed the FPU mode,
> > > your FPU opcode picker is grabbing random instructions before and after the FPU
> > > mode change and executing them as load dependancies become available.
> > >
> > > I would be some (most?) CPU's just flush the queue, expensive.
> >
> > Internally makes multiple copies of each FPU instruction
> > for each mode, when decoding FPU instructions look at
> > the control register to issue the correct one, when Reading
> > a "change control register" instruction halt the FPU
> > decoder until that instruction is executed (even if speculatively),
> > no pipeline/queue flush or whatever, ok, it
> > is more expensive than a simple nop, but not as expensive as the 100 flushed instructions you implied.
>
> There are lot of ways to fix this, including your partial fix.
> Flushing the queue is easy and already implemented for other uses, doing things "right"
> costs engineering time, verification time, transistors, heat, etc. Considering how rare
> FPU mode changes are I would argue flushing is the right thing to do on just about every
> metric except the one where programmers want to FPU mode switch on every function call.
>
> Historical baggage is a killer of new better ideas.
>
> So what do modern Intel and IBM chips do on a FPU mode change,
> someone undoubtably cares and has benchmarked it.
>
On big Intel cores the write to MXCSR (SSE status/control register) is serializing, which I interpret as pipeline drain rather than pipeline flush that you implied in a post above. Reads are also slow, but not quite as bad. I don't know how slow exactly.
On Bonnel/Saltwell writes to MXCSR are relatively quick (4 clocks), but reads are slow (15 clocks).
On the other hand, write to x87 control word (FCW) on big Intel cores can be rather fast if you follow certain rules (osculate between two constant FCW values, which happens naturally in our case of library entry/exit).
I don't know how fast/slow it works on AMD, Intel Silvermont and on various IBM Power processors.
> EduardoS (no.delete@this.spam.com) on April 5, 2014 12:29 pm wrote:
> > Brett (ggtgp.delete@this.yahoo.com) on April 5, 2014 12:12 pm wrote:
> > > You have ~100 instructions in the speculation queue and just changed the FPU mode,
> > > your FPU opcode picker is grabbing random instructions before and after the FPU
> > > mode change and executing them as load dependancies become available.
> > >
> > > I would be some (most?) CPU's just flush the queue, expensive.
> >
> > Internally makes multiple copies of each FPU instruction
> > for each mode, when decoding FPU instructions look at
> > the control register to issue the correct one, when Reading
> > a "change control register" instruction halt the FPU
> > decoder until that instruction is executed (even if speculatively),
> > no pipeline/queue flush or whatever, ok, it
> > is more expensive than a simple nop, but not as expensive as the 100 flushed instructions you implied.
>
> There are lot of ways to fix this, including your partial fix.
> Flushing the queue is easy and already implemented for other uses, doing things "right"
> costs engineering time, verification time, transistors, heat, etc. Considering how rare
> FPU mode changes are I would argue flushing is the right thing to do on just about every
> metric except the one where programmers want to FPU mode switch on every function call.
>
> Historical baggage is a killer of new better ideas.
>
> So what do modern Intel and IBM chips do on a FPU mode change,
> someone undoubtably cares and has benchmarked it.
>
On big Intel cores the write to MXCSR (SSE status/control register) is serializing, which I interpret as pipeline drain rather than pipeline flush that you implied in a post above. Reads are also slow, but not quite as bad. I don't know how slow exactly.
On Bonnel/Saltwell writes to MXCSR are relatively quick (4 clocks), but reads are slow (15 clocks).
On the other hand, write to x87 control word (FCW) on big Intel cores can be rather fast if you follow certain rules (osculate between two constant FCW values, which happens naturally in our case of library entry/exit).
I don't know how fast/slow it works on AMD, Intel Silvermont and on various IBM Power processors.