Cracking is not free

By: anon.1 (, December 25, 2018 8:14 pm
Room: Moderated Discussions
Travis Downs ( on December 25, 2018 3:41 pm wrote:
> Wilco ( on December 22, 2018 10:24 am wrote:
> > > The result of the load doesn't need to participate in renaming
> > > in the same way because it is not architecturally
> > > visible and doesn't persist beyond the instruction. It just needs to get from the load to the single op
> > > that will ever consume it, which is a different problem and likely off the critical path.
> >
> > You could special case it since no other instruction needs to read the renamed register.
> > But other than that it's like any other destination that needs to be renamed.
> I don't agree. In fact, it occurs to me that you don't even need the temporary register:
> both halves of the operation can use the same physical register: the load goes into the
> register, and the operation executes against it and updates it. There are no intermediate
> operations so nothing needs to see the state after the load but before the op.
> However it is implemented, it is fairly clear to me that in practice on x86
> (and I suspect other archs as well although I am less familiar) the lack of
> a second output register lets these pairs be renamed efficiently as a unit.
> The alternative view is that somehow the renamer is capable of renaming two-destination
> operations, but only in the special case of fused load ops and they somehow don't expand
> that functionality to other two-destination instructions. That seems ... unlikely.
> > As is the flags register.
> Yes, flags have to be renamed, but does it matter here?
> In any case, one strategy to rename flags, which I believe is used on modern Intel x86, is simply to
> extend each physical register to hold the set of flags bits in addition to the reg data, so any instruction
> which has a destination register (almost all of them) uses the same physical register to hold the flags
> which result from the operation as well. The renamer tracks in-order which flags map to which physical
> register (it's often more than one on x86 because of instructions which write to a subset of flags),
> and adds as input the correct reg(s) for any instruction which consumes flags.
> So using that method supporting flags isn't much harder than regular renaming
> on the write side (since you are already generally allocating a physical register
> for the destination), and acts like another input on the read side.
> > All instructions must be renamed either way. Early cracking means more micro ops to
> > rename, so throughput is simply lower. Now you could widen rename, but that has a
> > high cost. A cheaper approach is to add support for a 2nd destination register.
> I'm not really following. The cost to rename isn't counted in instructions, I don't think,
> it's largely counted in terms of operations, input registers and output registers. You can't
> get around renaming limits simply by fusing 10 instructions into one big operation with
> 10 outputs and then pretend your renamer treats that as a single instruction!
> Said another way, if you can build a 6-wide single destination renamer, I don't think that implies you
> can "easily" build a 6-wide double destination renamer on the same technology with the same people. Maybe
> you can build a 3, 4 or 5 wide one though. How that all pans out depends on fraction of actually fused
> operations: if you have an instruction set that inherently has lots of double destination instructions
> (e.g., auto-increment) then maybe you favor narrower crack-at-rename. If the majority of instructions are
> single destination (see x86), maybe you favor a wider single-destination renamer and early cracking.
> Out of curiosity, any idea what Apple is doing in the A-series?
> >
> > > Note that I agree with you that rename bandwidth is saved
> > > in the case of micro-fused/uncracked-until-after
> > > rename ops like load-op on x86 or call/store on ARM. It's only for 2-output
> > > ops like auto-increment this doesn't apply.
> >
> > A modern Arm core can execute 2 loads with auto-increment every cycle using just 2 rename slots.
> > The other slots are still free for other instructions, so yes it saves rename bandwidth.
> As above, I don't think this is how it works. It's not an apples to apples comparison to compare
> a late-cracking renamer capable of handling N multi-destination instructions, with an early
> cracking one. The latter is simpler so can be wider at the same design point. I'm not saying
> one is inherently better than the other, just that it's not obvious that you can cram a ton
> of complexity into one op and then build a renamer that supports this efficiently.

I agree with Travis here. There are two pieces to renaming: source renaming and destination renaming. Source renaming (mapping source arch regs to phy regs) is mostly a read-port limitation in the aliasing table (not entirely though as we'll see below). This is somewhat "easy" as you can get more read ports through duplication of state if that's the bottleneck. The harder problem is dest arch reg to phy reg renaming and forwarding the results in the same dispatch packet. In a single cycle, you need to pop N entries out of the PRF freelist (N being the number of destinations being renamed), assign them to destination arch registers, and bypass these assignments to ops in the same dispatch group that may source the same arch reg (like add r0,r1,r2; mul r3, r0, r0 needs to rename R0 and forward the phy reg to both sources of mul ). That's basically a lot of bypassing, and grows non-linearly with destination rename width (op 1 dest may be sourced by 5 other ops in a 6-wide dispatch and each op may take 2 or even 3 sources.). The op format doesn't really matter at that point; it's just how many destinations can you rename and forward to how many sources. If you rename more destinations, you have to pop more entries out of the freelist and forward more of these to sources. It's more useful to think of the problem in terms of total number of dests and sources renamed per cycle. ARM's ldp does nothing to simplify this problem. ARM's fused arithmetic instructions do help: add with shifts and 'complex' addressing modes, as do x86's complex addressing modes and ld-op, ld-op-st, etc. Whether they are implemented as cracked ops before dispatch or post-dispatch is an implementation detail and I assume are subjected to careful tradeoff studies.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
RISC-V Summit ProceedingsGabriele Svelto2018/12/19 08:36 AM
  RISC-V gut feelingsKonrad Schwarz2018/12/20 04:30 AM
    RISC-V inferior to ARMv8Heikki Kultala2018/12/20 07:36 AM
      RISC-V inferior to ARMv8Wilco2018/12/20 01:31 PM
        RISC-V inferior to ARMv8Travis Downs2018/12/20 02:18 PM
          RISC-V inferior to ARMv8Wilco2018/12/21 03:43 AM
            RISC-V inferior to ARMv8Ronald Maas2018/12/21 09:35 AM
          RISC-V inferior to ARMv8juanrga2018/12/21 10:28 AM
            RISC-V inferior to ARMv8Maynard Handley2018/12/21 02:39 PM
              RISC-V inferior to ARMv8anon2018/12/21 03:38 PM
                RISC-V inferior to ARMv8juanrga2018/12/23 04:39 AM
                  With similar logic nor do frequency (NT)Megol2018/12/23 09:45 AM
              RISC-V inferior to ARMv8juanrga2018/12/23 04:44 AM
                RISC-V inferior to ARMv8Wilco2018/12/23 06:21 AM
      RISC-V inferior to ARMv8Michael S2018/12/20 03:24 PM
        RISC-V inferior to ARMv8anon2018/12/20 04:22 PM
          RISC-V inferior to ARMv8Travis Downs2018/12/21 06:16 PM
            RISC-V inferior to ARMv8anon2018/12/22 03:53 AM
              Execution runtimes and SpectreFoo_2018/12/22 06:02 AM
        RISC-V inferior to ARMv8Adrian2018/12/20 08:51 PM
          RISC-V inferior to ARMv8Doug S2018/12/20 11:10 PM
            RISC-V inferior to ARMv8Adrian2018/12/20 11:38 PM
              RISC-V inferior to ARMv8Michael S2018/12/21 02:31 AM
                RISC-V inferior to ARMv8Adrian2018/12/21 03:23 AM
            RISC-V inferior to ARMv8random person2018/12/21 02:04 AM
              RISC-V inferior to ARMv8dmcq2018/12/21 04:27 AM
              RISC-V inferior to ARMv8juanrga2018/12/21 10:36 AM
              RISC-V inferior to ARMv8Doug S2018/12/21 12:02 PM
            RISC-V inferior to ARMv8juanrga2018/12/21 10:23 AM
          RISC-V inferior to ARMv8Adrian2018/12/20 11:21 PM
          RISC-V inferior to ARMv8anon2018/12/21 01:48 AM
            RISC-V inferior to ARMv8Adrian2018/12/21 03:44 AM
              RISC-V inferior to ARMv8anon2018/12/21 05:24 AM
            RISC-V inferior to ARMv8Adrian2018/12/21 04:09 AM
              RISC-V inferior to ARMv8Wilco2018/12/21 04:28 AM
          RISC-V inferior to ARMv8Michael S2018/12/21 02:27 AM
            RISC-V inferior to ARMv8Gabriele Svelto2018/12/21 01:09 PM
              RISC-V inferior to ARMv8Maynard Handley2018/12/21 02:58 PM
              RISC-V inferior to ARMv8Wilco2018/12/21 03:43 PM
                RISC-V inferior to ARMv8Travis Downs2018/12/21 05:45 PM
                  RISC-V inferior to ARMv8Wilco2018/12/22 04:37 AM
                    RISC-V inferior to ARMv8Travis Downs2018/12/22 06:54 AM
                      RISC-V inferior to ARMv8Wilco2018/12/22 10:32 AM
                Cracking is not freeGabriele Svelto2018/12/22 02:09 AM
                  Cracking is not freeWilco2018/12/22 04:32 AM
                    Cracking is not freeTravis Downs2018/12/22 07:07 AM
                      Cracking is not freeWilco2018/12/22 07:38 AM
                        Cracking is not freeTravis Downs2018/12/22 07:47 AM
                          Cracking is not freeWilco2018/12/22 10:24 AM
                            Cracking is not freeTravis Downs2018/12/25 03:41 PM
                              Cracking is not freeanon.12018/12/25 08:14 PM
                        multi-instruction decode and renamePaul A. Clayton2018/12/22 06:45 PM
                    Cracking is not freeGabriele Svelto2018/12/22 12:30 PM
                      Cracking is not freeWilco2018/12/23 06:48 AM
                      Cracking is not freeMichael S2018/12/23 08:09 AM
                        Cracking is not freeGabriele Svelto2018/12/26 02:53 PM
          RISC-V inferior to ARMv8rwessel2018/12/21 01:13 PM
          RISC-V inferior to ARMv8Seni2018/12/21 02:33 PM
            RISC-V inferior to ARMv8Wilco2018/12/21 03:33 PM
              RISC-V inferior to ARMv8Travis Downs2018/12/21 05:49 PM
                RISC-V inferior to ARMv8Wilco2018/12/22 04:58 AM
                  RISC-V inferior to ARMv8Travis Downs2018/12/22 07:03 AM
                    RISC-V inferior to ARMv8Wilco2018/12/22 07:22 AM
                      RISC-V inferior to ARMv8Travis Downs2018/12/22 07:40 AM
        RISC-V inferior to ARMv8dmcq2018/12/21 03:57 AM
      RISC-V inferior to ARMv8Konrad Schwarz2018/12/21 02:25 AM
      RISC-V inferior to ARMv8j2018/12/21 10:46 AM
        RISC-V inferior to ARMv8Travis Downs2018/12/21 06:08 PM
          RISC-V inferior to ARMv8dmcq2018/12/22 07:45 AM
            RISC-V inferior to ARMv8Travis Downs2018/12/22 07:50 AM
              RISC-V inferior to ARMv8Michael S2018/12/22 08:15 AM
                RISC-V inferior to ARMv8dmcq2018/12/22 10:41 AM
        RISC-V inferior to ARMv8AnonQ2018/12/22 08:13 AM
    RISC-V gut feelingsdmcq2018/12/20 07:41 AM
      RISC-V initial takeKonrad Schwarz2018/12/21 02:17 AM
        RISC-V initial takedmcq2018/12/21 03:23 AM
      RISC-V gut feelingsMontaray Jack2018/12/22 02:56 PM
        RISC-V gut feelingsdmcq2018/12/23 04:38 AM
  RISC-V Summit Proceedingsjuanrga2018/12/21 10:47 AM
    RISC-V Summit Proceedingsdmcq2018/12/22 06:21 AM
      RISC-V Summit ProceedingsMontaray Jack2018/12/22 02:03 PM
        RISC-V Summit Proceedingsdmcq2018/12/23 04:39 AM
  RISC-V Summit Proceedingsanon22018/12/21 10:57 AM
    RISC-V Summit ProceedingsMichael S2018/12/22 08:36 AM
      RISC-V Summit ProceedingsAnon2018/12/22 05:51 PM
      Not Stanford MIPS but commercial MIPSPaul A. Clayton2018/12/23 03:05 AM
        Not Stanford MIPS but commercial MIPSMichael S2018/12/23 03:49 AM
        Not Stanford MIPS but commercial MIPSdmcq2018/12/23 04:52 AM
Reply to this Topic
Body: No Text
How do you spell green?