By: Exophase (exophase.delete@this.gmail.com), April 26, 2012 4:59 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 4/26/12 wrote:
---------------------------
>I think the first question to ask is whether it matters. In many respects, this
>arrangement seems to mirror the design of the K8. Each 'lane' has an AGU and ALU,
>to handle load+op. Forwarding flows out of the lane from the ALU.
>
>Looking from a SW perspective, what is the case that AGU0-->AGU1 or AGU0-->ALU1 handles?
>
>If you fire off a load from AGU0, you get the result in a register. If some other
>ALU/AGU needs that, it can probably get the result from the L1D forwarding or from the register file directly.
>
Yes, I was just looking at load to load or multiple ALU accesses off of the same load. I took "bypass" here to mean register forwarding (ie, bypassing the register file) but still ending up a register. Should I have taken it to refer to intermediate results only, like those from load + op?
>Hrmmm, why? Intel's AGUs do nothing but address generation IIRC.
>
>DK
Would appear so. Somehow I was under the impression that ports 2 through 4 could handle something outside of loads and stores, but I was mistaken. I suppose there's less need when they have three other ports. It does appear that AMD is at least trying to use the AGLUs for something outside of address generation. Using inc/dec in the post-adjust sense threw a lot of people off but makes much more sense. BTW, does this mean that the pop instruction only issues to the AGUs?
I think I've been thrown off by some misinformation here and have only been gradually getting it straight. Months before BD was released JF-AMD made the claim that K10 had 3 execution ports in which you can do either AGU or ALU, while BD can do 2x ALU + 2x AGU simultaneously and therefore had higher peak throughput. I was pretty sure this was nonsense (that K8/K10 could do all 6 + 3 FPU ops) but I figured that the AGUs would at least be somewhat less coupled to the ALUs.
The other misleading part is the rhetoric about how "that third ALU" was almost never used. In reality the EX ports on BD, which are issued to in everything but loads. I'm sure that ALU frequency analysis involves actual ALU operations only, while the EX units will face additional contention from branches and stores which comprise a large number of instructions.
May be the same execution unit pair arrangement as K8, but missing that third part is pretty costly (minus the useless AGU)
---------------------------
>I think the first question to ask is whether it matters. In many respects, this
>arrangement seems to mirror the design of the K8. Each 'lane' has an AGU and ALU,
>to handle load+op. Forwarding flows out of the lane from the ALU.
>
>Looking from a SW perspective, what is the case that AGU0-->AGU1 or AGU0-->ALU1 handles?
>
>If you fire off a load from AGU0, you get the result in a register. If some other
>ALU/AGU needs that, it can probably get the result from the L1D forwarding or from the register file directly.
>
Yes, I was just looking at load to load or multiple ALU accesses off of the same load. I took "bypass" here to mean register forwarding (ie, bypassing the register file) but still ending up a register. Should I have taken it to refer to intermediate results only, like those from load + op?
>Hrmmm, why? Intel's AGUs do nothing but address generation IIRC.
>
>DK
Would appear so. Somehow I was under the impression that ports 2 through 4 could handle something outside of loads and stores, but I was mistaken. I suppose there's less need when they have three other ports. It does appear that AMD is at least trying to use the AGLUs for something outside of address generation. Using inc/dec in the post-adjust sense threw a lot of people off but makes much more sense. BTW, does this mean that the pop instruction only issues to the AGUs?
I think I've been thrown off by some misinformation here and have only been gradually getting it straight. Months before BD was released JF-AMD made the claim that K10 had 3 execution ports in which you can do either AGU or ALU, while BD can do 2x ALU + 2x AGU simultaneously and therefore had higher peak throughput. I was pretty sure this was nonsense (that K8/K10 could do all 6 + 3 FPU ops) but I figured that the AGUs would at least be somewhat less coupled to the ALUs.
The other misleading part is the rhetoric about how "that third ALU" was almost never used. In reality the EX ports on BD, which are issued to in everything but loads. I'm sure that ALU frequency analysis involves actual ALU operations only, while the EX units will face additional contention from branches and stores which comprise a large number of instructions.
May be the same execution unit pair arrangement as K8, but missing that third part is pretty costly (minus the useless AGU)



