New article: AMD's Jaguar Microarchitecture

Article: AMD's Jaguar Microarchitecture
By: Wilco (Wilco.Dijkstra.delete@this.ntlworld.com), April 7, 2014 4:01 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on April 7, 2014 2:52 pm wrote:
> Wilco (Wilco.Dijkstra.delete@this.ntlworld.com) on April 7, 2014 10:47 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on April 7, 2014 10:38 am wrote:
> > > Michael S (already5chosen.delete@this.yahoo.com) on April 7, 2014 2:20 am wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on April 6, 2014 3:52 pm wrote:
> > > > > Michael S (already5chosen.delete@this.yahoo.com) on April 6, 2014 1:50 am wrote:
> > > > > > Maynard Handley (name99.delete@this.name99.org) on April 5, 2014 6:01 pm wrote:
> > > > > > > Michael S (already5chosen.delete@this.yahoo.com) on April 5, 2014 10:59 am wrote:
> > > > > > > >
> > > > > > > > I don't follow.
> > > > > > > > Why should one save/restore the whole FPU state, why not only FP control register? Saving/restoring
> > > > > > > > only FPU control register is not slow relatively to the rest of library call overhead
> > > > > > > > and it *should* be done by any library that wants to use non-default control bits. Anything
> > > > > > > > less is bug and should not be tolerated bu library users.
> > > > > > > >
> > > > > > >
> > > > > > > By FPU state I mean the stuff that would be in an FP control register. I don't
> > > > > > > mean the state of standard FP registers. Sorry, I guess that was not clear.
> > > > > > >
> > > > > > > The point is
> > > > > > > (a) put it ALL CONTROL state in ONE register. And don't make the problem artificially
> > > > > > > more complicated by mixing up CONTROL bits with STATUS bits.
> > > > > > > (b) allow rapid reading and writing of that control state register with a single user level
> > > > > > > read/write operation, instead of a twiddle one bit at a time with high latency model.
> > > > > >
> > > > > > Intel/AMD MXCSR, indeed combines control and status in the
> > > > > > same registers. Same for Power FPSCR. And, indeed,
> > > > > > I don't like it from pure theoretical point of view. But in practice it is not a significant problem.
> > > > >
> > > > > Reading and writing control both make sense, writing status does not make sense.
> > > >
> > > > Writing status does not make sense, but it does not hurt. At least I don't see why it could possibly hurt.
> > > >
> > > > >
> > > > > Reading control can be as fast as reading any other register, reading status
> > > > > means you have to delay the read until the status is available.
> > > > >
> > > >
> > > > In theory that makes sense.
> > > > Trying to compare the theory with practice:
> > > > x87 has separate control and status words (=good), while SSE has them intermixed (=bad).
> > > > I tried to compare the speed of FNSTCW instruction (x87 Control Word read=good) vs STMCSR (SSE
> > > > Control/Status Word read=bad) in Agner's tables. Unfortunately, he provides very little information
> > > > about the latency of these instructions and even available information is unlikely to be reliable.
> > > > Which (lack of reliable information) sort of makes sense since the latency in this case is not
> > > > only hard to measure, it's even hard to define what exactly would be considered latency.
> > > > Agner has throughput number, so and for majority of cores the throughput of FNSTCW is, indeed,
> > > > significantly higher than STMCSR, but, first, in this case throughput is not very interested
> > > > and second, on Nehalem and SB/EB the thoughput is the same for both - 1 per clock.
> > > >
> > > >
> > > > > Writing control (in a model where control is separate from status) is also easy and fast because now "all"
> > > > > you do is attach the control bits (as they are defined at
> > > > > the point of instruction execution) to each instruction,
> > > > > and they then flow along with instruction execution. At
> > > > > the point where you're making each decision (rounding,
> > > > > deform, etc), you consult the attached bit, not the global register. The main additional feature you need
> > > > > to add is that FP instructions later than a change of FP control don't move past an FP control write. (Or
> > > > > you can speculate on this, but an explicit test is probably easier and more performant.)
> > > >
> > > > It's not clear to me where exactly you can do this "easy" attachment on OoO machine.
> > >
> > > At register rename time. In this model the control bits live in the FP register bank. Rename (which
> > > occurs in order) would read them from the bank for every FP instruction, just adding them to the
> > > instruction as a few more bits at the same time that rename info is being added to an instruction.
> > > Setting the control state would correspond to writing these bits in the FP register bank.
> > > This may require a one cycle hiccup at rename, something like setting FP control state is the only
> > > FP rename that can be processed during a single cycle, but that's much cheaper than what we have
> > > today, and will likely occur during function entry/exit where there's a cycle free anyway.
> > >
> > > This implementation means that my previous concerns about FP instructions sliding past the FP control write
> > > now become unimportant. After rename the FP control write is basically a NOP; we only need to keep tracking
> > > it in the case that we want to unwind after a mispredicted branch, so while we need to slot it into the
> > > ROB (probably carrying as freight the PREVIOUS value of the control bits), we can either never even put
> > > it in the dispatch queue, or we can drop it/send it to be (NOP) "executed" as soon as convenient.
> >
> > So how do you know what value you need to attach to each FP instruction, given that the actual
> > value of the control register write will be computed maybe 20 cycles after rename?
> >
> > Wilco
> >
>
> Read what I said.
> The control register write will occur AT the point the "rename" is executed. The rename
> stage for FP-control-write will "perform" by swapping the old FP-control bits (which
> then get attached to the FP-control-write for unwind) and the new FP-control bits,
> which are now the canonical bits, ready for every subsequent instruction.
> This is an unorthodox execution flow, but it's perfectly feasible, and no different in
> principle from things like detecting NOPs and squelching them early in the pipeline.

I read what you wrote and you didn't take into account that on an OoO machine rename does not execute instructions, so the register values are not available at that time. Consider the case of where you load the control register value and it misses the cache. It can take easily 20 or more cycles before the loaded value is available. So how is it getting to rename unless you stall the whole pipeline until the load has completed?

The fastest possible implementation to do the same as you do with the flags register: rename it. Every FP instruction gets the control register as a renamed register source. This means you can write the control register in quick succession without any penalty and FP operations based on different control register values can be executed OoO. You can get fast status register reads in a similar way. The only question is whether it is worth the cost in complexity and power as you shouldn't really be reading/writing status registers that often.

Wilco
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
New article: AMD's Jaguar MicroarchitectureDavid Kanter2014/04/01 01:19 AM
  New article: AMD's Jaguar MicroarchitectureSHK2014/04/01 06:09 AM
    New article: AMD's Jaguar MicroarchitectureJeff Rupley2014/04/01 07:13 PM
      New article: AMD's Jaguar MicroarchitectureSHK2014/04/02 06:45 AM
        CMOV is 3 operand given register renamingPaul A. Clayton2014/04/02 09:11 AM
          CMOV is 3 operand given register renamingSHK2014/04/02 12:17 PM
            Limited operand tags in issue queue entriesPaul A. Clayton2014/04/02 01:32 PM
        New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/02 12:48 PM
          New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/02 02:32 PM
  New article: AMD's Jaguar MicroarchitectureGeorge2014/04/01 02:10 PM
  New article: AMD's Jaguar Microarchitecturewillmore2014/04/01 06:37 PM
    New article: AMD's Jaguar Microarchitecturewillmore2014/04/01 07:08 PM
    New article: AMD's Jaguar MicroarchitectureNaN2014/04/02 08:58 AM
      New article: AMD's Jaguar MicroarchitectureUnmaskedUnderflow2014/04/04 07:16 AM
        New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/04 08:54 AM
          New article: AMD's Jaguar MicroarchitectureUnmaskedUnderflow2014/04/04 11:45 AM
            New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/04 02:00 PM
              New article: AMD's Jaguar MicroarchitectureNoSpammer2014/04/04 03:15 PM
              New article: AMD's Jaguar MicroarchitectureTREZA2014/04/04 03:18 PM
                New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/04 04:56 PM
                  New article: AMD's Jaguar MicroarchitectureTREZA2014/04/04 05:34 PM
                  New article: AMD's Jaguar MicroarchitectureMichael S2014/04/05 11:02 AM
                  New article: AMD's Jaguar Microarchitecturecomputational_scientist2014/04/05 06:50 PM
                    New article: AMD's Jaguar MicroarchitectureMichael S2014/04/06 01:22 AM
                    New article: AMD's Jaguar MicroarchitectureWilco2014/04/06 05:29 AM
                      New article: AMD's Jaguar Microarchitecturecomputational_scientist2014/04/06 07:33 AM
                        New article: AMD's Jaguar MicroarchitectureWilco2014/04/07 03:12 AM
                          New article: AMD's Jaguar MicroarchitectureMichael S2014/04/07 06:58 AM
                        New article: AMD's Jaguar MicroarchitectureEduardoS2014/04/07 04:34 PM
                      New article: AMD's Jaguar Microarchitecturecomputational_scientist2014/04/06 07:53 AM
                      New article: AMD's Jaguar MicroarchitectureMegol2014/04/06 08:21 AM
                        New article: AMD's Jaguar Microarchitecturenone2014/04/06 09:07 AM
                          New article: AMD's Jaguar MicroarchitectureMichael S2014/04/06 09:23 AM
                        New article: AMD's Jaguar MicroarchitectureWilco2014/04/06 02:48 PM
                          New article: AMD's Jaguar MicroarchitectureTREZA2014/04/06 03:47 PM
                            New article: AMD's Jaguar MicroarchitectureMichael S2014/04/07 02:34 AM
                              New article: AMD's Jaguar MicroarchitectureWilco2014/04/07 03:27 AM
                                New article: AMD's Jaguar MicroarchitectureMichael S2014/04/07 05:39 AM
                                  New article: AMD's Jaguar MicroarchitectureUnmaskedUnderflow2014/04/07 12:26 PM
                                    New article: AMD's Jaguar MicroarchitectureMichael S2014/04/07 01:42 PM
                                    New article: AMD's Jaguar MicroarchitectureWilco2014/04/07 01:50 PM
                                      New article: AMD's Jaguar MicroarchitectureUnmaskedUnderflow2014/04/07 02:11 PM
                                        New article: AMD's Jaguar MicroarchitectureWilco2014/04/07 05:44 PM
                                      New article: AMD's Jaguar MicroarchitectureTREZA2014/04/07 03:38 PM
              denormal on IvyB and HaswellMichael S2014/04/05 10:45 AM
                Forum searchiz2014/04/05 12:54 PM
                denormal on IvyB and HaswellLinus Torvalds2014/04/06 09:55 AM
                  denormal on IvyB and HaswellMichael S2014/04/17 06:43 PM
            New article: AMD's Jaguar Microarchitecturedmcq2014/04/05 06:52 AM
            New article: AMD's Jaguar MicroarchitectureMaynard Handley2014/04/05 10:38 AM
              New article: AMD's Jaguar MicroarchitectureMichael S2014/04/05 10:59 AM
                New article: AMD's Jaguar MicroarchitectureBrett2014/04/05 12:12 PM
                  New article: AMD's Jaguar MicroarchitectureEduardoS2014/04/05 12:29 PM
                    New article: AMD's Jaguar MicroarchitectureBrett2014/04/05 01:00 PM
                      New article: AMD's Jaguar MicroarchitectureMichael S2014/04/06 02:18 AM
                        New article: AMD's Jaguar MicroarchitectureBrett2014/04/06 10:08 AM
                          New article: AMD's Jaguar MicroarchitectureBrett2014/04/06 10:11 AM
                New article: AMD's Jaguar MicroarchitectureMaynard Handley2014/04/05 06:01 PM
                  New article: AMD's Jaguar MicroarchitectureMichael S2014/04/06 01:50 AM
                    New article: AMD's Jaguar MicroarchitectureMaynard Handley2014/04/06 03:52 PM
                      New article: AMD's Jaguar MicroarchitectureMichael S2014/04/07 02:20 AM
                        New article: AMD's Jaguar MicroarchitectureMaynard Handley2014/04/07 10:38 AM
                          New article: AMD's Jaguar MicroarchitectureWilco2014/04/07 10:47 AM
                            New article: AMD's Jaguar MicroarchitectureMaynard Handley2014/04/07 02:52 PM
                              New article: AMD's Jaguar MicroarchitectureWilco2014/04/07 04:01 PM
                                New article: AMD's Jaguar MicroarchitectureSeni2014/04/08 02:03 PM
                                  New article: AMD's Jaguar MicroarchitectureWilco2014/04/08 02:56 PM
                                    New article: AMD's Jaguar MicroarchitectureMichael S2014/04/08 04:05 PM
                                      New article: AMD's Jaguar MicroarchitectureMaynard Handley2014/04/08 06:55 PM
                                        New article: AMD's Jaguar MicroarchitectureMichael S2014/04/09 01:12 AM
                  New article: AMD's Jaguar MicroarchitectureWilco2014/04/06 04:51 AM
  New article: AMD's Jaguar MicroarchitectureWaltC2014/04/02 01:52 PM
    New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/02 02:25 PM
      New article: AMD's Jaguar Microarchitectureitsmydamnation2014/04/03 12:19 AM
      New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/09 01:44 PM
        New article: AMD's Jaguar MicroarchitectureDavid Kanter2014/04/10 11:24 PM
          New article: AMD's Jaguar Microarchitecturenone2014/04/11 01:49 AM
          New article: AMD's Jaguar MicroarchitectureLinus Torvalds2014/04/11 09:14 AM
    New article: AMD's Jaguar MicroarchitectureRyan Dean2014/04/03 01:04 AM
  New article: AMD's Jaguar MicroarchitecturePaul A. Clayton2014/04/02 05:02 PM
  New article: AMD's Jaguar MicroarchitectureRicky Chan2014/04/03 07:50 AM
    New article: AMD's Jaguar Microarchitecturesomeone2014/04/04 07:18 AM
  New article: AMD's Jaguar Microarchitecturebakaneko2014/04/09 03:08 PM
    New article: AMD's Jaguar MicroarchitectureTREZA2014/04/09 05:34 PM
  Jaguar's detailsHugo DĂ©charnes2014/06/07 04:08 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?