By: EduardoS (no.delete@this.spam.com), April 26, 2012 6:03 pm
Room: Moderated Discussions
Exophase (exophase@gmail.com) on 4/26/12 wrote:
---------------------------
>The manual doesn't mention the role the AGUs play in mem + op at all, not even
>as a footnote. It also indicates that loads use the EX ports when they're supposed to be AG port only.
Ok, AGUs calculate addresses, load are listed as using EX units, the conclusion is quite obvious, don't you think?
>The obvious difference between K10 and BD is one less execution port. Does that
>not constitute as a significant difference to you?
Not in that context, the discussion isn't exactly about how many units BD have but how they are arranged or how they work, and they seens to work exactly like in previous uarch.
>Of course it's not like K10 was
>the performance leader (nor was K8 in the end) so remaining the same wouldn't exactly
>be anything to brag about.
Maybe this was the marketing reason for not saying how the units would work, a poor reason IMHO.
>But BD manages to take a step back in some regards.
Yep.
>Still it'd seem to you that the execution arrangement per core is fine
If you had read my posts you would know that I think the number of units is too few and isn't even theorically enough to match SB single thread IPC wich is above 2 in many workloads, but you prefered to ignore the context and reach another conclusion.
BTW, if the AGUs were able to perform loads without the need of ALUs and also a few simple instructions the execution width would also be enough and Bulldozer exceed Greyhound in IPC, this independence of units was repeat a lto by John Fruehe and many others with some claiming BD could sustain four instruction per clock on a single thread, more and more those claimings looks like myths.
Also the independence between ALUs and AGUs would make the forwarding network more complex reducing the clock speed, maybe even more complex and slower than the "oversized" 3 ALUs + 3 AGUs in previous uarch.
And finally, compared to previous uarch BD IPC only dropped a little, even increasing in some workloads (wich happens to be memory subsystem - are you the annoying anon wich require every detail explicit in every sentence? - intensive) while execution resources dropepd a lot and latencies increased a lot, one possible reason is a huge improvement in the memory subsytem.
>and everyone who disagrees is just testing it wrong
The only guy who did a detailed enough analyse agrees that the memory subsystem is fine, everyone else saw little performance difference make no effort to find bottlenecks and concluded that "it must be because of L2 latency since the avarage IPC is about one so it doesn't matter if you have two or three units"...
ps: How long since I heard for the first time "avarage IPC is about"? Maybe ten years?
---------------------------
>The manual doesn't mention the role the AGUs play in mem + op at all, not even
>as a footnote. It also indicates that loads use the EX ports when they're supposed to be AG port only.
Ok, AGUs calculate addresses, load are listed as using EX units, the conclusion is quite obvious, don't you think?
>The obvious difference between K10 and BD is one less execution port. Does that
>not constitute as a significant difference to you?
Not in that context, the discussion isn't exactly about how many units BD have but how they are arranged or how they work, and they seens to work exactly like in previous uarch.
>Of course it's not like K10 was
>the performance leader (nor was K8 in the end) so remaining the same wouldn't exactly
>be anything to brag about.
Maybe this was the marketing reason for not saying how the units would work, a poor reason IMHO.
>But BD manages to take a step back in some regards.
Yep.
>Still it'd seem to you that the execution arrangement per core is fine
If you had read my posts you would know that I think the number of units is too few and isn't even theorically enough to match SB single thread IPC wich is above 2 in many workloads, but you prefered to ignore the context and reach another conclusion.
BTW, if the AGUs were able to perform loads without the need of ALUs and also a few simple instructions the execution width would also be enough and Bulldozer exceed Greyhound in IPC, this independence of units was repeat a lto by John Fruehe and many others with some claiming BD could sustain four instruction per clock on a single thread, more and more those claimings looks like myths.
Also the independence between ALUs and AGUs would make the forwarding network more complex reducing the clock speed, maybe even more complex and slower than the "oversized" 3 ALUs + 3 AGUs in previous uarch.
And finally, compared to previous uarch BD IPC only dropped a little, even increasing in some workloads (wich happens to be memory subsystem - are you the annoying anon wich require every detail explicit in every sentence? - intensive) while execution resources dropepd a lot and latencies increased a lot, one possible reason is a huge improvement in the memory subsytem.
>and everyone who disagrees is just testing it wrong
The only guy who did a detailed enough analyse agrees that the memory subsystem is fine, everyone else saw little performance difference make no effort to find bottlenecks and concluded that "it must be because of L2 latency since the avarage IPC is about one so it doesn't matter if you have two or three units"...
ps: How long since I heard for the first time "avarage IPC is about"? Maybe ten years?



