By: Axel (no.delete@this.spam.org), January 20, 2011 2:18 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 1/20/11 wrote:
---------------------------
>>AVX != AVX
>>Bulldozer's problem (if it is a problem at all) is only AVX >code with 256bit.
>>However, this is only a fraction of all instructions.
>
>Yes, that's correct.
>
>>Biggest advantage of AVX is that it supplies 3operand >instruction formats for *all*
>>previous SSE1,2,3,4.1/.2/AES instructions.
>
>I'd also add the microarchitecture for the loads and stores is significant.
Hmm, yes, but this is Sandy's microarchitecure, as you stated correctly; it is not due to the AVX x86 ISA supplement.
>There are two issues. While the overall program may only have a small number of
>256-bit operations, they may be issued closely together in time and thus stall/bottleneck
>on the FPU. Worse yet, both threads could end up contributing 256-bit instructions.
>
>Equally important, the load datapath to the FPU is 2x128-bit wide, which is definitely
>a bottleneck for two cores to use.
Ahh, yes thanks,I forgot the L2. I first thought that the 2 L1D$ are sufficient for 2x2x128bit loads, but they are only 16kB. Thus most of the traffic will go to L2$, and the interface to the L2 won't be 512bit wide.
Furthermore, I wonder if a 256AVX instruction from core 0 could use the L1D$ from core 1 at all. Do you know if something like that is possible ?
regards,
Axel
---------------------------
>>AVX != AVX
>>Bulldozer's problem (if it is a problem at all) is only AVX >code with 256bit.
>>However, this is only a fraction of all instructions.
>
>Yes, that's correct.
>
>>Biggest advantage of AVX is that it supplies 3operand >instruction formats for *all*
>>previous SSE1,2,3,4.1/.2/AES instructions.
>
>I'd also add the microarchitecture for the loads and stores is significant.
Hmm, yes, but this is Sandy's microarchitecure, as you stated correctly; it is not due to the AVX x86 ISA supplement.
>There are two issues. While the overall program may only have a small number of
>256-bit operations, they may be issued closely together in time and thus stall/bottleneck
>on the FPU. Worse yet, both threads could end up contributing 256-bit instructions.
>
>Equally important, the load datapath to the FPU is 2x128-bit wide, which is definitely
>a bottleneck for two cores to use.
Ahh, yes thanks,I forgot the L2. I first thought that the 2 L1D$ are sufficient for 2x2x128bit loads, but they are only 16kB. Thus most of the traffic will go to L2$, and the interface to the L2 won't be 512bit wide.
Furthermore, I wonder if a 256AVX instruction from core 0 could use the L1D$ from core 1 at all. Do you know if something like that is possible ?
regards,
Axel



