By: juanrga (nospam.delete@this.juanrga.com), November 2, 2015 3:47 am
Room: Moderated Discussions
Poindexter (cherullo.delete@this.gmail.com) on November 1, 2015 9:25 am wrote:
> juanrga (nospam.delete@this.juanrga.com) on November 1, 2015 7:00 am wrote:
> > Poindexter (cherullo.delete@this.gmail.com) on October 31, 2015 2:47 pm wrote:
> > > > I am sure the reason for 4ALU+2AGU and 128bit FP pipes is not because it is the best possible
> > > > configuration. The real reason? We only can speculate at this time. Maybe a cache bottleneck did
> > > > make adding a third AGU useless, maybe the fourth ALU is here for symmetry reasons, maybe...
> > >
> > > I find it funny that you like to tout pipe numbers, but you never discuss
> > > other architectural features that have direct impact in this discussion:
> > > - MOV elimination
> > > - Store-to-load forwarding
> > > - Memory reordering and memory disambiguation
> > > - Instruction fusing
> >
> > I would like to know how you think they affect the discussion. E.g., how read-modify
> > or read-modify-write fusion reduce the number of loads and stores?
>
> It may reduce the number of required uops, and thus change
> the optimal (whatever that means) pipelines ratio.
Macro-op fusion doesn't reduce the required number of uops. Macro-op fusion joins uops for simplifying the SS/OOOE logic. E.g., instead a separate entry in the ROB for each uop, you have one entry for the macro-op.
Once the macro-op is scheduled to the corresponding place, it is broken into uops and the execution units execute each uop.
The number of uops is not reduced.
> > You would also read my analysis of the four half-pipes on Zen and why I expect
> > performance on non-FMA code to be more close to Bulldozer than to IvyBridge.
>
> Please, where can I find this analysis?
>
It is in the same forum where I gave you instruction counts for mobile, server, and HPC workloads.
> juanrga (nospam.delete@this.juanrga.com) on November 1, 2015 7:00 am wrote:
> > Poindexter (cherullo.delete@this.gmail.com) on October 31, 2015 2:47 pm wrote:
> > > > I am sure the reason for 4ALU+2AGU and 128bit FP pipes is not because it is the best possible
> > > > configuration. The real reason? We only can speculate at this time. Maybe a cache bottleneck did
> > > > make adding a third AGU useless, maybe the fourth ALU is here for symmetry reasons, maybe...
> > >
> > > I find it funny that you like to tout pipe numbers, but you never discuss
> > > other architectural features that have direct impact in this discussion:
> > > - MOV elimination
> > > - Store-to-load forwarding
> > > - Memory reordering and memory disambiguation
> > > - Instruction fusing
> >
> > I would like to know how you think they affect the discussion. E.g., how read-modify
> > or read-modify-write fusion reduce the number of loads and stores?
>
> It may reduce the number of required uops, and thus change
> the optimal (whatever that means) pipelines ratio.
Macro-op fusion doesn't reduce the required number of uops. Macro-op fusion joins uops for simplifying the SS/OOOE logic. E.g., instead a separate entry in the ROB for each uop, you have one entry for the macro-op.
Once the macro-op is scheduled to the corresponding place, it is broken into uops and the execution units execute each uop.
The number of uops is not reduced.
> > You would also read my analysis of the four half-pipes on Zen and why I expect
> > performance on non-FMA code to be more close to Bulldozer than to IvyBridge.
>
> Please, where can I find this analysis?
>
It is in the same forum where I gave you instruction counts for mobile, server, and HPC workloads.