By: David Kanter (dkanter.delete@this.realworldtech.com), April 26, 2012 4:03 pm
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 4/26/12 wrote:
---------------------------
>David Kanter (dkanter@realworldtech.com) on 4/25/12 wrote:
>---------------------------
>>Just a few quick notes from the Bulldozer ISSCC presentations:
>>
>>1. AGUs do INC/DEC for PUSH/POP only
>>2. Single cycle data and flag bypass from any of the 4 units
>>3. Replicated PRF for 8 reads and 4 writes (each unit can write 1 result)
>>4. ALUs bypass to all ALUs and AGUs
>>5. AGU0 bypasses to ALU0 and AGU1 to ALU1. AGUs do not bypass to each other or to the other ALU.
>>6. Only ALUs produce and consume flags
>>7. INC was replicated to maintain single cycle latency on certain paths
>>8. The L2 cache array has a 6-cycle internal pipeline. I suspect that means the
>>remaining cycles are for L1D check (4 cycles) and then 10 cycles for sending the
>>request, returning the data, checking the L2 tags, etc.
>>
>>Hopefully this sheds a bit more light on Bulldozer.
>>
>>
>>David
>>
>>
>
>Thank you, this is very helpful information. In this case the person who claimed
>to have tested inc/dec + alu throughput must have been mistaken. Do you know if
>the AGUs handle both 32-bit and 64-bit push/pop stack >adjustment?
I don't *know*. But I don't see why you wouldn't do 64-bit...you have the hardware for it.
>I wonder how much the lack of bypassing from AGU to the >opposite ALU and AGU hurts
>performance. What kind of penalty does this incur; does >latency increase entirely
>until a PRF writeback stage or does it take a longer bypass >path (ie, AG0->EX0->EX1->AG1 to cover the entire distance)?
I think the first question to ask is whether it matters. In many respects, this arrangement seems to mirror the design of the K8. Each 'lane' has an AGU and ALU, to handle load+op. Forwarding flows out of the lane from the ALU.
Looking from a SW perspective, what is the case that AGU0-->AGU1 or AGU0-->ALU1 handles?
If you fire off a load from AGU0, you get the result in a register. If some other ALU/AGU needs that, it can probably get the result from the L1D forwarding or from the register file directly.
>Do you have any opinion on the claims that the SOG makes regarding Piledriver,
>that it adds mov, xchg, bextr, and xadd capability to the AGUs? mov/xchg make sense,
>but xadd/bextr modify the flags. Whether this means the AGUs can now modify flags
>or that flag modification can be suppressed like Linus was describing, it doesn't
>make sense that these obscure instructions would be >supported while the more common ones wouldn't be.
No idea.
>Still seems like the AG units having full access to the PRF but only being able
>to do the operations they can do is kind of a waste.
Hrmmm, why? Intel's AGUs do nothing but address generation IIRC.
DK
---------------------------
>David Kanter (dkanter@realworldtech.com) on 4/25/12 wrote:
>---------------------------
>>Just a few quick notes from the Bulldozer ISSCC presentations:
>>
>>1. AGUs do INC/DEC for PUSH/POP only
>>2. Single cycle data and flag bypass from any of the 4 units
>>3. Replicated PRF for 8 reads and 4 writes (each unit can write 1 result)
>>4. ALUs bypass to all ALUs and AGUs
>>5. AGU0 bypasses to ALU0 and AGU1 to ALU1. AGUs do not bypass to each other or to the other ALU.
>>6. Only ALUs produce and consume flags
>>7. INC was replicated to maintain single cycle latency on certain paths
>>8. The L2 cache array has a 6-cycle internal pipeline. I suspect that means the
>>remaining cycles are for L1D check (4 cycles) and then 10 cycles for sending the
>>request, returning the data, checking the L2 tags, etc.
>>
>>Hopefully this sheds a bit more light on Bulldozer.
>>
>>
>>David
>>
>>
>
>Thank you, this is very helpful information. In this case the person who claimed
>to have tested inc/dec + alu throughput must have been mistaken. Do you know if
>the AGUs handle both 32-bit and 64-bit push/pop stack >adjustment?
I don't *know*. But I don't see why you wouldn't do 64-bit...you have the hardware for it.
>I wonder how much the lack of bypassing from AGU to the >opposite ALU and AGU hurts
>performance. What kind of penalty does this incur; does >latency increase entirely
>until a PRF writeback stage or does it take a longer bypass >path (ie, AG0->EX0->EX1->AG1 to cover the entire distance)?
I think the first question to ask is whether it matters. In many respects, this arrangement seems to mirror the design of the K8. Each 'lane' has an AGU and ALU, to handle load+op. Forwarding flows out of the lane from the ALU.
Looking from a SW perspective, what is the case that AGU0-->AGU1 or AGU0-->ALU1 handles?
If you fire off a load from AGU0, you get the result in a register. If some other ALU/AGU needs that, it can probably get the result from the L1D forwarding or from the register file directly.
>Do you have any opinion on the claims that the SOG makes regarding Piledriver,
>that it adds mov, xchg, bextr, and xadd capability to the AGUs? mov/xchg make sense,
>but xadd/bextr modify the flags. Whether this means the AGUs can now modify flags
>or that flag modification can be suppressed like Linus was describing, it doesn't
>make sense that these obscure instructions would be >supported while the more common ones wouldn't be.
No idea.
>Still seems like the AG units having full access to the PRF but only being able
>to do the operations they can do is kind of a waste.
Hrmmm, why? Intel's AGUs do nothing but address generation IIRC.
DK



