By: Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com), November 16, 2012 1:24 pm
Room: Moderated Discussions
Felid (Felid.delete@this.mailinator.com) on November 16, 2012 12:46 pm wrote:
> Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on November 16, 2012 9:47 am wrote:
> > > I think it was only 128-bit in the first place (I mean the internal datapath here).
> >
> > My understanding is that VEX.256 packed divides and square roots are processed as 2 x 128-bit operations
> > on Sandy Bridge (there is a single division unit in a single 128-bit stack, unlike FP ADD/FP MUL) but as
> > a single operation on Ivy Bridge (there is 2 division units in two 128-bit stacks, like FP ADD/FP MUL)
>
> Wrong. Divider-rooter is the only large 128-bit unit left in vector datapath. This one is just
> «kind of» pipelined. SP, DP & EP are now divided in 7, 14 & 18 clks (SB had — 14, 22 & 24).
Since you mention "EP" and from your timings it looks like you talk about scalar x87 code
Anyway, I was coming to my conclusion from the doubled throughput for the packed SP case and nearly doubled for the packed DP case (see rcp througputs below), I wasn't aware this is due to an improved pipelining, do you have a source to provide for this?
source : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026
> Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on November 16, 2012 9:47 am wrote:
> > > I think it was only 128-bit in the first place (I mean the internal datapath here).
> >
> > My understanding is that VEX.256 packed divides and square roots are processed as 2 x 128-bit operations
> > on Sandy Bridge (there is a single division unit in a single 128-bit stack, unlike FP ADD/FP MUL) but as
> > a single operation on Ivy Bridge (there is 2 division units in two 128-bit stacks, like FP ADD/FP MUL)
>
> Wrong. Divider-rooter is the only large 128-bit unit left in vector datapath. This one is just
> «kind of» pipelined. SP, DP & EP are now divided in 7, 14 & 18 clks (SB had — 14, 22 & 24).
Since you mention "EP" and from your timings it looks like you talk about scalar x87 code
Anyway, I was coming to my conclusion from the doubled throughput for the packed SP case and nearly doubled for the packed DP case (see rcp througputs below), I wasn't aware this is due to an improved pipelining, do you have a source to provide for this?
SNB IVB
VDIPS/VSQRTPS 28 14
VDIVPD/VSQRTPD 44 28
source : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026



