By: hcl64 (mario.smarq.delete@this.gmail.com), April 28, 2012 3:23 pm
Room: Moderated Discussions
anon (anon@anon.com) on 4/28/12 wrote:
---------------------------
>hcl64 (mario.smarq@gmail.com) on 4/28/12 wrote:
>---------------------------
>>Paul A. Clayton (paaronclayton@gmail.com) on 4/20/12 wrote:
>>---------------------------
>>>Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
>>>---------------------------
>>>[snip]
>>>>What was the purpose of using a shared decoder even
>>>>supposed to be? Is the size/power overhead of a pair of
>>>>4-wide decoders really that large in a modern
>>>>desktop/server CPU?
>>>>
>>
>>If i understand this correctly, the *decoder* is NOT shared in the sense that
>it only crunches from 1 thread at a time.
>>
>>http://www.realworldtech.com/forums/index.cfm?action=detail&id=128835&threadid=128602&roomid=2
>>
>>I believe it uses a scheme of interleaving multithreading(1 inst from each thread
>>but in consecutive cycles) mix with block or switch-on-event multithreading(several
>>insts from one thread before switching to the other). Not in any occasion is it executing from more than 1 thread.
>
>The decoder is shared. There is no nitpicking of semantics that will allow you
>to say decoder is not shared. And definitely not, within context of the thread you are replying to.
>
What i was trying to say is that is not SMT in any case... and if i read and ear correctly, in no occasion does the decoder execute from more than 1 thread at a time.
>The answer is yes: fast, wide, low latency x86 decoders are power hungry. Intel
>has been trying to reduce / remove the decoder from the critical paths since the
>Pentium 4, with significant additional complexity. Also AMD and Intel both share
>decoders among threads/cores. So we have empirical evidence.
>
>>
>>>>Perhaps a single beefy 4-issue or 6-issue core with SMT
>>>>would have been a smarter move.
>>>
>>
>>A 6 wide issue processor for x86 is simply a pipe dream... until there will be
>>ways to considerably break the "strong dependency model" of x86 it will be out of
>>reach. 4 may be already too much (BD is a false 4 wide issue), since even the strongest
>>Intel u-arch doesn't pass in average the 2 IPC(instructions per clock)...
>
>There is a grain of truth to that, but "average IPC" is largely parroted for the wrong reasons.
>
>When there *is* parallelism, you want to take advantage of it. If you can execute
>33% of the time at 2 IPC, and 67% of the time at 0.5 IPC, then you're averaging
>1 IPC. But it does not mean your decoders are a waste of space.
>
Neither did i imply that. Actually to have good 2 threads on the same front-end be it on Intel SMT or AMD scheme, i think a 5th pipe would be welcome. They haven't done it yet perhaps becasue it might be very hard to accomplish efficiently.
>Nobody should use the "average IPC" statistic without deeply understanding what they are talking about.
>
>> makes
>>wonder under the law of exponential diminishing returns if even a *3 wide issue*
>>like NH/SNB/IB makes sense(perhaps that is why Intel have
>
>Core2, NH, WM, SNB, and IB are 4-wide issue.
>
>SIMD and FP to the same ports of INt... and SMT)
>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
>>
>>And no NH/SNB/IB are NOT true 6 wide issue u-archs, it only dispatches 4 uops cycle
>>so 4 is the theoretical max sustainable but under certain
>
>Ah, you are using IBM terminology of dispatch to back end, and issue to execution units. Fair enough.
>
:)
>I guess people try to claim they are wider than they are because of instruction
>fusion and such, but of course that does not change the actual width, only perhaps the effective width.
>
>>conditions, because it
>>only has 3 uop exec ports and there are considerable dependencies to attend(its always much much less average).
>
>Issue/dispatch width has nothing to do with what the microarchitecture can execute
>on average, of course (as I said above).
>
>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5
>>
>>
>>>As discussed here earlier, the motive seems to have been
>>>to allow substantial sharing between threads in a high
>>>frequency design without the data cache issues that the
>>>early Pentium4 SMT suffered.
>>
>>As above the only things that shares threads in BD are the FlexFPU and the L2..
>
>Are we looking at the same Bulldozer? BD shares the entire front end, L1I, branch
>predictors, ITLBs, fetch, decode and issuing logic. As well as FPU and L2.
>
well its semantics to evidence what is SMT and what is not.. actually when it comes to BD its very hard to pin point a specific multithreading scheme for sharing stuff.. BD uses almost all multithreading schemes i can think of (SMP, SMT, CMT, interleaving, block, switch-on-event )
>>not even a remote resemblance with P4, more so because BD pipeline length is only
>>15 stages, and the first 2 or 3 stages are decoupled and can run ahead, so like
>>SNB L0, it has the propriety of pipeline stage compression, in this case to 13 or 12.
>
>How does this pipeline stage compression work?
>
The same way of the L0, but from the IBBs.. if fetch is decoupled and can run-ahead, chances are when one instructions is needed its already in the IBB.
The L0 has the advantage that is much larger and of being a cache, and being closer to the execution pipes in the pipeline...so stage compression or contraction effect, is larger and longer in terms of consecutive clock cycles.
---------------------------
>hcl64 (mario.smarq@gmail.com) on 4/28/12 wrote:
>---------------------------
>>Paul A. Clayton (paaronclayton@gmail.com) on 4/20/12 wrote:
>>---------------------------
>>>Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
>>>---------------------------
>>>[snip]
>>>>What was the purpose of using a shared decoder even
>>>>supposed to be? Is the size/power overhead of a pair of
>>>>4-wide decoders really that large in a modern
>>>>desktop/server CPU?
>>>>
>>
>>If i understand this correctly, the *decoder* is NOT shared in the sense that
>it only crunches from 1 thread at a time.
>>
>>http://www.realworldtech.com/forums/index.cfm?action=detail&id=128835&threadid=128602&roomid=2
>>
>>I believe it uses a scheme of interleaving multithreading(1 inst from each thread
>>but in consecutive cycles) mix with block or switch-on-event multithreading(several
>>insts from one thread before switching to the other). Not in any occasion is it executing from more than 1 thread.
>
>The decoder is shared. There is no nitpicking of semantics that will allow you
>to say decoder is not shared. And definitely not, within context of the thread you are replying to.
>
What i was trying to say is that is not SMT in any case... and if i read and ear correctly, in no occasion does the decoder execute from more than 1 thread at a time.
>The answer is yes: fast, wide, low latency x86 decoders are power hungry. Intel
>has been trying to reduce / remove the decoder from the critical paths since the
>Pentium 4, with significant additional complexity. Also AMD and Intel both share
>decoders among threads/cores. So we have empirical evidence.
>
>>
>>>>Perhaps a single beefy 4-issue or 6-issue core with SMT
>>>>would have been a smarter move.
>>>
>>
>>A 6 wide issue processor for x86 is simply a pipe dream... until there will be
>>ways to considerably break the "strong dependency model" of x86 it will be out of
>>reach. 4 may be already too much (BD is a false 4 wide issue), since even the strongest
>>Intel u-arch doesn't pass in average the 2 IPC(instructions per clock)...
>
>There is a grain of truth to that, but "average IPC" is largely parroted for the wrong reasons.
>
>When there *is* parallelism, you want to take advantage of it. If you can execute
>33% of the time at 2 IPC, and 67% of the time at 0.5 IPC, then you're averaging
>1 IPC. But it does not mean your decoders are a waste of space.
>
Neither did i imply that. Actually to have good 2 threads on the same front-end be it on Intel SMT or AMD scheme, i think a 5th pipe would be welcome. They haven't done it yet perhaps becasue it might be very hard to accomplish efficiently.
>Nobody should use the "average IPC" statistic without deeply understanding what they are talking about.
>
>> makes
>>wonder under the law of exponential diminishing returns if even a *3 wide issue*
>>like NH/SNB/IB makes sense(perhaps that is why Intel have
>
>Core2, NH, WM, SNB, and IB are 4-wide issue.
>
>SIMD and FP to the same ports of INt... and SMT)
>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
>>
>>And no NH/SNB/IB are NOT true 6 wide issue u-archs, it only dispatches 4 uops cycle
>>so 4 is the theoretical max sustainable but under certain
>
>Ah, you are using IBM terminology of dispatch to back end, and issue to execution units. Fair enough.
>
:)
>I guess people try to claim they are wider than they are because of instruction
>fusion and such, but of course that does not change the actual width, only perhaps the effective width.
>
>>conditions, because it
>>only has 3 uop exec ports and there are considerable dependencies to attend(its always much much less average).
>
>Issue/dispatch width has nothing to do with what the microarchitecture can execute
>on average, of course (as I said above).
>
>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5
>>
>>
>>>As discussed here earlier, the motive seems to have been
>>>to allow substantial sharing between threads in a high
>>>frequency design without the data cache issues that the
>>>early Pentium4 SMT suffered.
>>
>>As above the only things that shares threads in BD are the FlexFPU and the L2..
>
>Are we looking at the same Bulldozer? BD shares the entire front end, L1I, branch
>predictors, ITLBs, fetch, decode and issuing logic. As well as FPU and L2.
>
well its semantics to evidence what is SMT and what is not.. actually when it comes to BD its very hard to pin point a specific multithreading scheme for sharing stuff.. BD uses almost all multithreading schemes i can think of (SMP, SMT, CMT, interleaving, block, switch-on-event )
>>not even a remote resemblance with P4, more so because BD pipeline length is only
>>15 stages, and the first 2 or 3 stages are decoupled and can run ahead, so like
>>SNB L0, it has the propriety of pipeline stage compression, in this case to 13 or 12.
>
>How does this pipeline stage compression work?
>
The same way of the L0, but from the IBBs.. if fetch is decoupled and can run-ahead, chances are when one instructions is needed its already in the IBB.
The L0 has the advantage that is much larger and of being a cache, and being closer to the execution pipes in the pipeline...so stage compression or contraction effect, is larger and longer in terms of consecutive clock cycles.



