By: hcl64 (mario.smarq.delete@this.gmail.com), April 30, 2012 7:24 pm
Room: Moderated Discussions
David Kanter (dkanter@realworldtech.com) on 4/30/12 wrote:
---------------------------
>
>>Also in that light SMT can be as efficiently implemented in a narrower
>>core, cause is not for heavy computations. Also if memory access will be more important
>>"a run-ahead scheme", that is, continue executing speculatively based on predicted
>>and stored values and addresses, after a L1 or L2 miss(cache miss mitigation), seems
>>better than SMT or spMT and so efficiently done at a >narrower core also.
>
>Run-ahead is a much simpler version of out-of-order. Judging by the results of
>the POWER6, it's not a particularly attractive design choice. IBM went back to OOOE pretty quickly.
>
yes somehow related to OOOE, but in this what i was thinking was more based on the work of James David Dundas, and is more of a dynamic pre-processing on the code stream, that is, upon a L1 or L2 *data* miss(L1, L2, L3, depends on how aggressive you want to take it) a processor *doesn't stall waiting* but can continue executing instructions based on predicted speculative *data* addresses and values, without waiting for the correct addresses and values to arrive, or dependencies to be resolved( it can assume a very lax approach to dependencies).
Contrary to forms of spMT or the Scout threads of Rock chip, this doesn't incur the penalties of a context switch( OTOH as limits on how much instructions can be processed this way- usually less than 3 digits by run, depending on size of additional buffers etc.).
Pre-processing so because many instructions can be so wrongly speculated and must be re-executed, but many can also be correctly speculated(large majority of the speculated in average if the prediction mechanisms are very good), and so it can be a very good pre-fetch and pre-execution method warming all caches and accelerating execution.
In this light, i think "run-ahead" and SMT can't happen on the same core at the same time... but there can be a switch between modes on the same core in a conditional subordinated way.
>
>>I'm not implying about which is better, to me none is better at all things, not
>>even now... but i'm very curious about AMD approach alright, that is why i inquired
>>about fusing on-the-fly 2 Integer macro-ops into one XOP >before
>>really curious if even Integer processing can be done in >good extent at a co-processor
>>( FlexFPU *is* a co-processor).
>
>1. What do you mean by a coprocessor? To me the FPU is just an FPU shared by two
>cores. It's no more a coprocessor than the FPUs in Sandy Bridge are coprocessors.
>
ummm... no not like SNB at all
Co-processor organization
http://techreport.com/r.x/bulldozer-uarch/bulldozer-fpu.jpg
So *perhaps*(don't know any details) that is one of the beauties of a *module*. Co-processors can not only happen on different PCBs, in different dies interposed in MCM, on the same die on different parts of a integrated xbar, but now also on much closer proximity...
>2. Speculatively fusing x86 instructions is challenging. What if they aren't adjacent
>in the code stream? Macro-fusion requires this, and it's common for CMP+JMP. But
>that's not necessarily true for integer adds.
>
>3. How do you handle load/store alignment?
>
>4. How do you handle exceptions that occur between the two instructions? You'd have to do a partial register rollback.
>
>5. How many x86 integer instructions have been extended with XOP? It's a relatively
>small number (add, multiply-add, compare).
>
>Honestly, I think it would be more productive to try and speculativcely fuse FP
>MUL and ADD, if you had an FMA unit with intermediate rounding.
>
Good questions, that perhaps you could pose to an AMD representative, if you have the chance Mr Kanter.
IF i'm not mistaken the majority of XOP is Integer (3 operand) total pack has more than one hundredth instructions defined, i think.
Perhaps integer fusing could be also a good thing.
---------------------------
>
>>Also in that light SMT can be as efficiently implemented in a narrower
>>core, cause is not for heavy computations. Also if memory access will be more important
>>"a run-ahead scheme", that is, continue executing speculatively based on predicted
>>and stored values and addresses, after a L1 or L2 miss(cache miss mitigation), seems
>>better than SMT or spMT and so efficiently done at a >narrower core also.
>
>Run-ahead is a much simpler version of out-of-order. Judging by the results of
>the POWER6, it's not a particularly attractive design choice. IBM went back to OOOE pretty quickly.
>
yes somehow related to OOOE, but in this what i was thinking was more based on the work of James David Dundas, and is more of a dynamic pre-processing on the code stream, that is, upon a L1 or L2 *data* miss(L1, L2, L3, depends on how aggressive you want to take it) a processor *doesn't stall waiting* but can continue executing instructions based on predicted speculative *data* addresses and values, without waiting for the correct addresses and values to arrive, or dependencies to be resolved( it can assume a very lax approach to dependencies).
Contrary to forms of spMT or the Scout threads of Rock chip, this doesn't incur the penalties of a context switch( OTOH as limits on how much instructions can be processed this way- usually less than 3 digits by run, depending on size of additional buffers etc.).
Pre-processing so because many instructions can be so wrongly speculated and must be re-executed, but many can also be correctly speculated(large majority of the speculated in average if the prediction mechanisms are very good), and so it can be a very good pre-fetch and pre-execution method warming all caches and accelerating execution.
In this light, i think "run-ahead" and SMT can't happen on the same core at the same time... but there can be a switch between modes on the same core in a conditional subordinated way.
>
>>I'm not implying about which is better, to me none is better at all things, not
>>even now... but i'm very curious about AMD approach alright, that is why i inquired
>>about fusing on-the-fly 2 Integer macro-ops into one XOP >before
>>really curious if even Integer processing can be done in >good extent at a co-processor
>>( FlexFPU *is* a co-processor).
>
>1. What do you mean by a coprocessor? To me the FPU is just an FPU shared by two
>cores. It's no more a coprocessor than the FPUs in Sandy Bridge are coprocessors.
>
ummm... no not like SNB at all
Co-processor organization
http://techreport.com/r.x/bulldozer-uarch/bulldozer-fpu.jpg
So *perhaps*(don't know any details) that is one of the beauties of a *module*. Co-processors can not only happen on different PCBs, in different dies interposed in MCM, on the same die on different parts of a integrated xbar, but now also on much closer proximity...
>2. Speculatively fusing x86 instructions is challenging. What if they aren't adjacent
>in the code stream? Macro-fusion requires this, and it's common for CMP+JMP. But
>that's not necessarily true for integer adds.
>
>3. How do you handle load/store alignment?
>
>4. How do you handle exceptions that occur between the two instructions? You'd have to do a partial register rollback.
>
>5. How many x86 integer instructions have been extended with XOP? It's a relatively
>small number (add, multiply-add, compare).
>
>Honestly, I think it would be more productive to try and speculativcely fuse FP
>MUL and ADD, if you had an FMA unit with intermediate rounding.
>
Good questions, that perhaps you could pose to an AMD representative, if you have the chance Mr Kanter.
IF i'm not mistaken the majority of XOP is Integer (3 operand) total pack has more than one hundredth instructions defined, i think.
Perhaps integer fusing could be also a good thing.



