By: anon.1 (abc.delete@this.def.com), December 25, 2018 8:14 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on December 25, 2018 3:41 pm wrote:
> Wilco (Wilco.dijkstra.delete@this.ntlworld.com) on December 22, 2018 10:24 am wrote:
> > > The result of the load doesn't need to participate in renaming
> > > in the same way because it is not architecturally
> > > visible and doesn't persist beyond the instruction. It just needs to get from the load to the single op
> > > that will ever consume it, which is a different problem and likely off the critical path.
> >
> > You could special case it since no other instruction needs to read the renamed register.
> > But other than that it's like any other destination that needs to be renamed.
>
> I don't agree. In fact, it occurs to me that you don't even need the temporary register:
> both halves of the operation can use the same physical register: the load goes into the
> register, and the operation executes against it and updates it. There are no intermediate
> operations so nothing needs to see the state after the load but before the op.
>
> However it is implemented, it is fairly clear to me that in practice on x86
> (and I suspect other archs as well although I am less familiar) the lack of
> a second output register lets these pairs be renamed efficiently as a unit.
>
>
> The alternative view is that somehow the renamer is capable of renaming two-destination
> operations, but only in the special case of fused load ops and they somehow don't expand
> that functionality to other two-destination instructions. That seems ... unlikely.
>
> > As is the flags register.
>
> Yes, flags have to be renamed, but does it matter here?
>
> In any case, one strategy to rename flags, which I believe is used on modern Intel x86, is simply to
> extend each physical register to hold the set of flags bits in addition to the reg data, so any instruction
> which has a destination register (almost all of them) uses the same physical register to hold the flags
> which result from the operation as well. The renamer tracks in-order which flags map to which physical
> register (it's often more than one on x86 because of instructions which write to a subset of flags),
> and adds as input the correct reg(s) for any instruction which consumes flags.
>
> So using that method supporting flags isn't much harder than regular renaming
> on the write side (since you are already generally allocating a physical register
> for the destination), and acts like another input on the read side.
>
>
> > All instructions must be renamed either way. Early cracking means more micro ops to
> > rename, so throughput is simply lower. Now you could widen rename, but that has a
> > high cost. A cheaper approach is to add support for a 2nd destination register.
>
> I'm not really following. The cost to rename isn't counted in instructions, I don't think,
> it's largely counted in terms of operations, input registers and output registers. You can't
> get around renaming limits simply by fusing 10 instructions into one big operation with
> 10 outputs and then pretend your renamer treats that as a single instruction!
>
> Said another way, if you can build a 6-wide single destination renamer, I don't think that implies you
> can "easily" build a 6-wide double destination renamer on the same technology with the same people. Maybe
> you can build a 3, 4 or 5 wide one though. How that all pans out depends on fraction of actually fused
> operations: if you have an instruction set that inherently has lots of double destination instructions
> (e.g., auto-increment) then maybe you favor narrower crack-at-rename. If the majority of instructions are
> single destination (see x86), maybe you favor a wider single-destination renamer and early cracking.
>
> Out of curiosity, any idea what Apple is doing in the A-series?
>
> >
> > > Note that I agree with you that rename bandwidth is saved
> > > in the case of micro-fused/uncracked-until-after
> > > rename ops like load-op on x86 or call/store on ARM. It's only for 2-output
> > > ops like auto-increment this doesn't apply.
> >
> > A modern Arm core can execute 2 loads with auto-increment every cycle using just 2 rename slots.
> > The other slots are still free for other instructions, so yes it saves rename bandwidth.
>
>
> As above, I don't think this is how it works. It's not an apples to apples comparison to compare
> a late-cracking renamer capable of handling N multi-destination instructions, with an early
> cracking one. The latter is simpler so can be wider at the same design point. I'm not saying
> one is inherently better than the other, just that it's not obvious that you can cram a ton
> of complexity into one op and then build a renamer that supports this efficiently.
I agree with Travis here. There are two pieces to renaming: source renaming and destination renaming. Source renaming (mapping source arch regs to phy regs) is mostly a read-port limitation in the aliasing table (not entirely though as we'll see below). This is somewhat "easy" as you can get more read ports through duplication of state if that's the bottleneck. The harder problem is dest arch reg to phy reg renaming and forwarding the results in the same dispatch packet. In a single cycle, you need to pop N entries out of the PRF freelist (N being the number of destinations being renamed), assign them to destination arch registers, and bypass these assignments to ops in the same dispatch group that may source the same arch reg (like add r0,r1,r2; mul r3, r0, r0 needs to rename R0 and forward the phy reg to both sources of mul ). That's basically a lot of bypassing, and grows non-linearly with destination rename width (op 1 dest may be sourced by 5 other ops in a 6-wide dispatch and each op may take 2 or even 3 sources.). The op format doesn't really matter at that point; it's just how many destinations can you rename and forward to how many sources. If you rename more destinations, you have to pop more entries out of the freelist and forward more of these to sources. It's more useful to think of the problem in terms of total number of dests and sources renamed per cycle. ARM's ldp does nothing to simplify this problem. ARM's fused arithmetic instructions do help: add with shifts and 'complex' addressing modes, as do x86's complex addressing modes and ld-op, ld-op-st, etc. Whether they are implemented as cracked ops before dispatch or post-dispatch is an implementation detail and I assume are subjected to careful tradeoff studies.
> Wilco (Wilco.dijkstra.delete@this.ntlworld.com) on December 22, 2018 10:24 am wrote:
> > > The result of the load doesn't need to participate in renaming
> > > in the same way because it is not architecturally
> > > visible and doesn't persist beyond the instruction. It just needs to get from the load to the single op
> > > that will ever consume it, which is a different problem and likely off the critical path.
> >
> > You could special case it since no other instruction needs to read the renamed register.
> > But other than that it's like any other destination that needs to be renamed.
>
> I don't agree. In fact, it occurs to me that you don't even need the temporary register:
> both halves of the operation can use the same physical register: the load goes into the
> register, and the operation executes against it and updates it. There are no intermediate
> operations so nothing needs to see the state after the load but before the op.
>
> However it is implemented, it is fairly clear to me that in practice on x86
> (and I suspect other archs as well although I am less familiar) the lack of
> a second output register lets these pairs be renamed efficiently as a unit.
>
>
> The alternative view is that somehow the renamer is capable of renaming two-destination
> operations, but only in the special case of fused load ops and they somehow don't expand
> that functionality to other two-destination instructions. That seems ... unlikely.
>
> > As is the flags register.
>
> Yes, flags have to be renamed, but does it matter here?
>
> In any case, one strategy to rename flags, which I believe is used on modern Intel x86, is simply to
> extend each physical register to hold the set of flags bits in addition to the reg data, so any instruction
> which has a destination register (almost all of them) uses the same physical register to hold the flags
> which result from the operation as well. The renamer tracks in-order which flags map to which physical
> register (it's often more than one on x86 because of instructions which write to a subset of flags),
> and adds as input the correct reg(s) for any instruction which consumes flags.
>
> So using that method supporting flags isn't much harder than regular renaming
> on the write side (since you are already generally allocating a physical register
> for the destination), and acts like another input on the read side.
>
>
> > All instructions must be renamed either way. Early cracking means more micro ops to
> > rename, so throughput is simply lower. Now you could widen rename, but that has a
> > high cost. A cheaper approach is to add support for a 2nd destination register.
>
> I'm not really following. The cost to rename isn't counted in instructions, I don't think,
> it's largely counted in terms of operations, input registers and output registers. You can't
> get around renaming limits simply by fusing 10 instructions into one big operation with
> 10 outputs and then pretend your renamer treats that as a single instruction!
>
> Said another way, if you can build a 6-wide single destination renamer, I don't think that implies you
> can "easily" build a 6-wide double destination renamer on the same technology with the same people. Maybe
> you can build a 3, 4 or 5 wide one though. How that all pans out depends on fraction of actually fused
> operations: if you have an instruction set that inherently has lots of double destination instructions
> (e.g., auto-increment) then maybe you favor narrower crack-at-rename. If the majority of instructions are
> single destination (see x86), maybe you favor a wider single-destination renamer and early cracking.
>
> Out of curiosity, any idea what Apple is doing in the A-series?
>
> >
> > > Note that I agree with you that rename bandwidth is saved
> > > in the case of micro-fused/uncracked-until-after
> > > rename ops like load-op on x86 or call/store on ARM. It's only for 2-output
> > > ops like auto-increment this doesn't apply.
> >
> > A modern Arm core can execute 2 loads with auto-increment every cycle using just 2 rename slots.
> > The other slots are still free for other instructions, so yes it saves rename bandwidth.
>
>
> As above, I don't think this is how it works. It's not an apples to apples comparison to compare
> a late-cracking renamer capable of handling N multi-destination instructions, with an early
> cracking one. The latter is simpler so can be wider at the same design point. I'm not saying
> one is inherently better than the other, just that it's not obvious that you can cram a ton
> of complexity into one op and then build a renamer that supports this efficiently.
I agree with Travis here. There are two pieces to renaming: source renaming and destination renaming. Source renaming (mapping source arch regs to phy regs) is mostly a read-port limitation in the aliasing table (not entirely though as we'll see below). This is somewhat "easy" as you can get more read ports through duplication of state if that's the bottleneck. The harder problem is dest arch reg to phy reg renaming and forwarding the results in the same dispatch packet. In a single cycle, you need to pop N entries out of the PRF freelist (N being the number of destinations being renamed), assign them to destination arch registers, and bypass these assignments to ops in the same dispatch group that may source the same arch reg (like add r0,r1,r2; mul r3, r0, r0 needs to rename R0 and forward the phy reg to both sources of mul ). That's basically a lot of bypassing, and grows non-linearly with destination rename width (op 1 dest may be sourced by 5 other ops in a 6-wide dispatch and each op may take 2 or even 3 sources.). The op format doesn't really matter at that point; it's just how many destinations can you rename and forward to how many sources. If you rename more destinations, you have to pop more entries out of the freelist and forward more of these to sources. It's more useful to think of the problem in terms of total number of dests and sources renamed per cycle. ARM's ldp does nothing to simplify this problem. ARM's fused arithmetic instructions do help: add with shifts and 'complex' addressing modes, as do x86's complex addressing modes and ld-op, ld-op-st, etc. Whether they are implemented as cracked ops before dispatch or post-dispatch is an implementation detail and I assume are subjected to careful tradeoff studies.
Topic | Posted By | Date |
---|---|---|
RISC-V Summit Proceedings | Gabriele Svelto | 2018/12/19 08:36 AM |
RISC-V gut feelings | Konrad Schwarz | 2018/12/20 04:30 AM |
RISC-V inferior to ARMv8 | Heikki Kultala | 2018/12/20 07:36 AM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/20 01:31 PM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/20 02:18 PM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/21 03:43 AM |
RISC-V inferior to ARMv8 | Ronald Maas | 2018/12/21 09:35 AM |
RISC-V inferior to ARMv8 | juanrga | 2018/12/21 10:28 AM |
RISC-V inferior to ARMv8 | Maynard Handley | 2018/12/21 02:39 PM |
RISC-V inferior to ARMv8 | anon | 2018/12/21 03:38 PM |
RISC-V inferior to ARMv8 | juanrga | 2018/12/23 04:39 AM |
With similar logic nor do frequency (NT) | Megol | 2018/12/23 09:45 AM |
RISC-V inferior to ARMv8 | juanrga | 2018/12/23 04:44 AM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/23 06:21 AM |
RISC-V inferior to ARMv8 | Michael S | 2018/12/20 03:24 PM |
RISC-V inferior to ARMv8 | anon | 2018/12/20 04:22 PM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/21 06:16 PM |
RISC-V inferior to ARMv8 | anon | 2018/12/22 03:53 AM |
Execution runtimes and Spectre | Foo_ | 2018/12/22 06:02 AM |
RISC-V inferior to ARMv8 | Adrian | 2018/12/20 08:51 PM |
RISC-V inferior to ARMv8 | Doug S | 2018/12/20 11:10 PM |
RISC-V inferior to ARMv8 | Adrian | 2018/12/20 11:38 PM |
RISC-V inferior to ARMv8 | Michael S | 2018/12/21 02:31 AM |
RISC-V inferior to ARMv8 | Adrian | 2018/12/21 03:23 AM |
RISC-V inferior to ARMv8 | random person | 2018/12/21 02:04 AM |
RISC-V inferior to ARMv8 | dmcq | 2018/12/21 04:27 AM |
RISC-V inferior to ARMv8 | juanrga | 2018/12/21 10:36 AM |
RISC-V inferior to ARMv8 | Doug S | 2018/12/21 12:02 PM |
RISC-V inferior to ARMv8 | juanrga | 2018/12/21 10:23 AM |
RISC-V inferior to ARMv8 | Adrian | 2018/12/20 11:21 PM |
RISC-V inferior to ARMv8 | anon | 2018/12/21 01:48 AM |
RISC-V inferior to ARMv8 | Adrian | 2018/12/21 03:44 AM |
RISC-V inferior to ARMv8 | anon | 2018/12/21 05:24 AM |
RISC-V inferior to ARMv8 | Adrian | 2018/12/21 04:09 AM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/21 04:28 AM |
RISC-V inferior to ARMv8 | Michael S | 2018/12/21 02:27 AM |
RISC-V inferior to ARMv8 | Gabriele Svelto | 2018/12/21 01:09 PM |
RISC-V inferior to ARMv8 | Maynard Handley | 2018/12/21 02:58 PM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/21 03:43 PM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/21 05:45 PM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/22 04:37 AM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/22 06:54 AM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/22 10:32 AM |
Cracking is not free | Gabriele Svelto | 2018/12/22 02:09 AM |
Cracking is not free | Wilco | 2018/12/22 04:32 AM |
Cracking is not free | Travis Downs | 2018/12/22 07:07 AM |
Cracking is not free | Wilco | 2018/12/22 07:38 AM |
Cracking is not free | Travis Downs | 2018/12/22 07:47 AM |
Cracking is not free | Wilco | 2018/12/22 10:24 AM |
Cracking is not free | Travis Downs | 2018/12/25 03:41 PM |
Cracking is not free | anon.1 | 2018/12/25 08:14 PM |
multi-instruction decode and rename | Paul A. Clayton | 2018/12/22 06:45 PM |
Cracking is not free | Gabriele Svelto | 2018/12/22 12:30 PM |
Cracking is not free | Wilco | 2018/12/23 06:48 AM |
Cracking is not free | Michael S | 2018/12/23 08:09 AM |
Cracking is not free | Gabriele Svelto | 2018/12/26 02:53 PM |
RISC-V inferior to ARMv8 | rwessel | 2018/12/21 01:13 PM |
RISC-V inferior to ARMv8 | Seni | 2018/12/21 02:33 PM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/21 03:33 PM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/21 05:49 PM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/22 04:58 AM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/22 07:03 AM |
RISC-V inferior to ARMv8 | Wilco | 2018/12/22 07:22 AM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/22 07:40 AM |
RISC-V inferior to ARMv8 | dmcq | 2018/12/21 03:57 AM |
RISC-V inferior to ARMv8 | Konrad Schwarz | 2018/12/21 02:25 AM |
RISC-V inferior to ARMv8 | j | 2018/12/21 10:46 AM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/21 06:08 PM |
RISC-V inferior to ARMv8 | dmcq | 2018/12/22 07:45 AM |
RISC-V inferior to ARMv8 | Travis Downs | 2018/12/22 07:50 AM |
RISC-V inferior to ARMv8 | Michael S | 2018/12/22 08:15 AM |
RISC-V inferior to ARMv8 | dmcq | 2018/12/22 10:41 AM |
RISC-V inferior to ARMv8 | AnonQ | 2018/12/22 08:13 AM |
RISC-V gut feelings | dmcq | 2018/12/20 07:41 AM |
RISC-V initial take | Konrad Schwarz | 2018/12/21 02:17 AM |
RISC-V initial take | dmcq | 2018/12/21 03:23 AM |
RISC-V gut feelings | Montaray Jack | 2018/12/22 02:56 PM |
RISC-V gut feelings | dmcq | 2018/12/23 04:38 AM |
RISC-V Summit Proceedings | juanrga | 2018/12/21 10:47 AM |
RISC-V Summit Proceedings | dmcq | 2018/12/22 06:21 AM |
RISC-V Summit Proceedings | Montaray Jack | 2018/12/22 02:03 PM |
RISC-V Summit Proceedings | dmcq | 2018/12/23 04:39 AM |
RISC-V Summit Proceedings | anon2 | 2018/12/21 10:57 AM |
RISC-V Summit Proceedings | Michael S | 2018/12/22 08:36 AM |
RISC-V Summit Proceedings | Anon | 2018/12/22 05:51 PM |
Not Stanford MIPS but commercial MIPS | Paul A. Clayton | 2018/12/23 03:05 AM |
Not Stanford MIPS but commercial MIPS | Michael S | 2018/12/23 03:49 AM |
Not Stanford MIPS but commercial MIPS | dmcq | 2018/12/23 04:52 AM |