By: anon (anon.delete@this.anon.com), April 28, 2012 6:19 pm
Room: Moderated Discussions
hcl64 (mario.smarq@gmail.com) on 4/28/12 wrote:
---------------------------
>anon (anon@anon.com) on 4/28/12 wrote:
>---------------------------
>>hcl64 (mario.smarq@gmail.com) on 4/28/12 wrote:
>>---------------------------
>>>Paul A. Clayton (paaronclayton@gmail.com) on 4/20/12 wrote:
>>>---------------------------
>>>>Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
>>>>---------------------------
>>>>[snip]
>>>>>What was the purpose of using a shared decoder even
>>>>>supposed to be? Is the size/power overhead of a pair of
>>>>>4-wide decoders really that large in a modern
>>>>>desktop/server CPU?
>>>>>
>>>
>>>If i understand this correctly, the *decoder* is NOT shared in the sense that
>>it only crunches from 1 thread at a time.
>>>
>>>http://www.realworldtech.com/forums/index.cfm?action=detail&id=128835&threadid=128602&roomid=2
>>>
>>>I believe it uses a scheme of interleaving multithreading(1 inst from each thread
>>>but in consecutive cycles) mix with block or switch-on-event multithreading(several
>>>insts from one thread before switching to the other). Not in any occasion is it executing from more than 1 thread.
>>
>>The decoder is shared. There is no nitpicking of semantics that will allow you
>>to say decoder is not shared. And definitely not, within context of the thread you are replying to.
>>
>
>What i was trying to say is that is not SMT in any case... and if i read and ear
>correctly, in no occasion does the decoder execute from more than 1 thread at a time.
It is shared, of course. Saying it is not shared because it is not SMT is completely confusing.
>
>>The answer is yes: fast, wide, low latency x86 decoders are power hungry. Intel
>>has been trying to reduce / remove the decoder from the critical paths since the
>>Pentium 4, with significant additional complexity. Also AMD and Intel both share
>>decoders among threads/cores. So we have empirical evidence.
>>
>>>
>>>>>Perhaps a single beefy 4-issue or 6-issue core with SMT
>>>>>would have been a smarter move.
>>>>
>>>
>>>A 6 wide issue processor for x86 is simply a pipe dream... until there will be
>>>ways to considerably break the "strong dependency model" of x86 it will be out of
>>>reach. 4 may be already too much (BD is a false 4 wide issue), since even the strongest
>>>Intel u-arch doesn't pass in average the 2 IPC(instructions per clock)...
>>
>>There is a grain of truth to that, but "average IPC" is largely parroted for the wrong reasons.
>>
>>When there *is* parallelism, you want to take advantage of it. If you can execute
>>33% of the time at 2 IPC, and 67% of the time at 0.5 IPC, then you're averaging
>>1 IPC. But it does not mean your decoders are a waste of space.
>>
>
>Neither did i imply that. Actually to have good 2 threads on the same front-end
>be it on Intel SMT or AMD scheme, i think a 5th pipe would be welcome. They haven't
>done it yet perhaps becasue it might be very hard to accomplish efficiently.
You did not only imply that, you said it pretty explicitly.
"4 may be already too much (BD is a false 4 wide issue), since even the strongest Intel u-arch doesn't pass in average the 2 IPC(instructions per clock)..."
You are making a conclusion that N-wide part of the pipeline may be too much if *average* IPC is < N.
To repeat: average IPC has virtually zero bearing on how wide a pipeline (any stage) ought to be.
>
>>Nobody should use the "average IPC" statistic without deeply understanding what they are talking about.
>>
>>> makes
>>>wonder under the law of exponential diminishing returns if even a *3 wide issue*
>>>like NH/SNB/IB makes sense(perhaps that is why Intel have
>>
>>Core2, NH, WM, SNB, and IB are 4-wide issue.
>>
>>SIMD and FP to the same ports of INt... and SMT)
>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
>>>
>>>And no NH/SNB/IB are NOT true 6 wide issue u-archs, it only dispatches 4 uops cycle
>>>so 4 is the theoretical max sustainable but under certain
>>
>>Ah, you are using IBM terminology of dispatch to back end, and issue to execution units. Fair enough.
>>
>
>:)
>
>>I guess people try to claim they are wider than they are because of instruction
>>fusion and such, but of course that does not change the actual width, only perhaps the effective width.
>>
>>>conditions, because it
>>>only has 3 uop exec ports and there are considerable dependencies to attend(its always much much less average).
>>
>>Issue/dispatch width has nothing to do with what the microarchitecture can execute
>>on average, of course (as I said above).
>>
>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5
>>
>>>
>>>
>>>>As discussed here earlier, the motive seems to have been
>>>>to allow substantial sharing between threads in a high
>>>>frequency design without the data cache issues that the
>>>>early Pentium4 SMT suffered.
>>>
>>>As above the only things that shares threads in BD are the FlexFPU and the L2..
>>
>>Are we looking at the same Bulldozer? BD shares the entire front end, L1I, branch
>>predictors, ITLBs, fetch, decode and issuing logic. As well as FPU and L2.
>>
>
>well its semantics to evidence what is SMT and what is not.. actually when it comes
>to BD its very hard to pin point a specific multithreading scheme for sharing stuff..
>BD uses almost all multithreading schemes i can think of (SMP, SMT, CMT, interleaving, block, switch-on-event )
It really is *not* semantics. It is all shared.
You can only claim to be arguing semantics if there is room to argue semantics. At some point, it's simply called: incorrect.
The threads perhaps do not share temporal residency on the fetch and decoder resource, but that does *not* equate to "it is not shared"!!
---------------------------
>anon (anon@anon.com) on 4/28/12 wrote:
>---------------------------
>>hcl64 (mario.smarq@gmail.com) on 4/28/12 wrote:
>>---------------------------
>>>Paul A. Clayton (paaronclayton@gmail.com) on 4/20/12 wrote:
>>>---------------------------
>>>>Kira (kirsc@aeterna.ru) on 4/20/12 wrote:
>>>>---------------------------
>>>>[snip]
>>>>>What was the purpose of using a shared decoder even
>>>>>supposed to be? Is the size/power overhead of a pair of
>>>>>4-wide decoders really that large in a modern
>>>>>desktop/server CPU?
>>>>>
>>>
>>>If i understand this correctly, the *decoder* is NOT shared in the sense that
>>it only crunches from 1 thread at a time.
>>>
>>>http://www.realworldtech.com/forums/index.cfm?action=detail&id=128835&threadid=128602&roomid=2
>>>
>>>I believe it uses a scheme of interleaving multithreading(1 inst from each thread
>>>but in consecutive cycles) mix with block or switch-on-event multithreading(several
>>>insts from one thread before switching to the other). Not in any occasion is it executing from more than 1 thread.
>>
>>The decoder is shared. There is no nitpicking of semantics that will allow you
>>to say decoder is not shared. And definitely not, within context of the thread you are replying to.
>>
>
>What i was trying to say is that is not SMT in any case... and if i read and ear
>correctly, in no occasion does the decoder execute from more than 1 thread at a time.
It is shared, of course. Saying it is not shared because it is not SMT is completely confusing.
>
>>The answer is yes: fast, wide, low latency x86 decoders are power hungry. Intel
>>has been trying to reduce / remove the decoder from the critical paths since the
>>Pentium 4, with significant additional complexity. Also AMD and Intel both share
>>decoders among threads/cores. So we have empirical evidence.
>>
>>>
>>>>>Perhaps a single beefy 4-issue or 6-issue core with SMT
>>>>>would have been a smarter move.
>>>>
>>>
>>>A 6 wide issue processor for x86 is simply a pipe dream... until there will be
>>>ways to considerably break the "strong dependency model" of x86 it will be out of
>>>reach. 4 may be already too much (BD is a false 4 wide issue), since even the strongest
>>>Intel u-arch doesn't pass in average the 2 IPC(instructions per clock)...
>>
>>There is a grain of truth to that, but "average IPC" is largely parroted for the wrong reasons.
>>
>>When there *is* parallelism, you want to take advantage of it. If you can execute
>>33% of the time at 2 IPC, and 67% of the time at 0.5 IPC, then you're averaging
>>1 IPC. But it does not mean your decoders are a waste of space.
>>
>
>Neither did i imply that. Actually to have good 2 threads on the same front-end
>be it on Intel SMT or AMD scheme, i think a 5th pipe would be welcome. They haven't
>done it yet perhaps becasue it might be very hard to accomplish efficiently.
You did not only imply that, you said it pretty explicitly.
"4 may be already too much (BD is a false 4 wide issue), since even the strongest Intel u-arch doesn't pass in average the 2 IPC(instructions per clock)..."
You are making a conclusion that N-wide part of the pipeline may be too much if *average* IPC is < N.
To repeat: average IPC has virtually zero bearing on how wide a pipeline (any stage) ought to be.
>
>>Nobody should use the "average IPC" statistic without deeply understanding what they are talking about.
>>
>>> makes
>>>wonder under the law of exponential diminishing returns if even a *3 wide issue*
>>>like NH/SNB/IB makes sense(perhaps that is why Intel have
>>
>>Core2, NH, WM, SNB, and IB are 4-wide issue.
>>
>>SIMD and FP to the same ports of INt... and SMT)
>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=6
>>>
>>>And no NH/SNB/IB are NOT true 6 wide issue u-archs, it only dispatches 4 uops cycle
>>>so 4 is the theoretical max sustainable but under certain
>>
>>Ah, you are using IBM terminology of dispatch to back end, and issue to execution units. Fair enough.
>>
>
>:)
>
>>I guess people try to claim they are wider than they are because of instruction
>>fusion and such, but of course that does not change the actual width, only perhaps the effective width.
>>
>>>conditions, because it
>>>only has 3 uop exec ports and there are considerable dependencies to attend(its always much much less average).
>>
>>Issue/dispatch width has nothing to do with what the microarchitecture can execute
>>on average, of course (as I said above).
>>
>>>http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=5
>>
>>>
>>>
>>>>As discussed here earlier, the motive seems to have been
>>>>to allow substantial sharing between threads in a high
>>>>frequency design without the data cache issues that the
>>>>early Pentium4 SMT suffered.
>>>
>>>As above the only things that shares threads in BD are the FlexFPU and the L2..
>>
>>Are we looking at the same Bulldozer? BD shares the entire front end, L1I, branch
>>predictors, ITLBs, fetch, decode and issuing logic. As well as FPU and L2.
>>
>
>well its semantics to evidence what is SMT and what is not.. actually when it comes
>to BD its very hard to pin point a specific multithreading scheme for sharing stuff..
>BD uses almost all multithreading schemes i can think of (SMP, SMT, CMT, interleaving, block, switch-on-event )
It really is *not* semantics. It is all shared.
You can only claim to be arguing semantics if there is room to argue semantics. At some point, it's simply called: incorrect.
The threads perhaps do not share temporal residency on the fetch and decoder resource, but that does *not* equate to "it is not shared"!!



