By: Michael S (already5chosen.delete@this.yahoo.com), November 17, 2012 12:50 pm
Room: Moderated Discussions
Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on November 16, 2012 1:24 pm wrote:
> Felid (Felid.delete@this.mailinator.com) on November 16, 2012 12:46 pm wrote:
> > Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on November 16, 2012 9:47 am wrote:
> > > > I think it was only 128-bit in the first place (I mean the internal datapath here).
> > >
> > > My understanding is that VEX.256 packed divides and square roots are processed as 2 x 128-bit operations
> > > on Sandy Bridge (there is a single division unit in a single 128-bit stack, unlike FP ADD/FP MUL) but as
> > > a single operation on Ivy Bridge (there is 2 division units in two 128-bit stacks, like FP ADD/FP MUL)
> >
> > Wrong. Divider-rooter is the only large 128-bit unit left in vector datapath. This one is just
> > «kind of» pipelined. SP, DP & EP are now divided in 7, 14 & 18 clks (SB had — 14, 22 & 24).
>
> Since you mention "EP" and from your timings it looks like you talk about scalar x87 code
>
> Anyway, I was coming to my conclusion from the doubled throughput for the packed SP case
> and nearly doubled for the packed DP case (see rcp througputs below), I wasn't aware
> this is due to an improved pipelining, do you have a source to provide for this?
>
>
>
> source : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026
>
>
>
The very same manual:
So, it's pretty obvious that both SNB and IVB have 128-bit division units, but on IVB they are partially pipelined.
DIV r64 is even more interesting:
So, long integer division on IVB is not just partially pipelined, but they somehow managed to cut worst case latency in half.
Looks like they now apply a different algorithm. Or, may be, just extended to 128b/64b an old "two bits at time" algorithm, that was in use for 64b/32b division since P5,
> Felid (Felid.delete@this.mailinator.com) on November 16, 2012 12:46 pm wrote:
> > Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on November 16, 2012 9:47 am wrote:
> > > > I think it was only 128-bit in the first place (I mean the internal datapath here).
> > >
> > > My understanding is that VEX.256 packed divides and square roots are processed as 2 x 128-bit operations
> > > on Sandy Bridge (there is a single division unit in a single 128-bit stack, unlike FP ADD/FP MUL) but as
> > > a single operation on Ivy Bridge (there is 2 division units in two 128-bit stacks, like FP ADD/FP MUL)
> >
> > Wrong. Divider-rooter is the only large 128-bit unit left in vector datapath. This one is just
> > «kind of» pipelined. SP, DP & EP are now divided in 7, 14 & 18 clks (SB had — 14, 22 & 24).
>
> Since you mention "EP" and from your timings it looks like you talk about scalar x87 code
>
> Anyway, I was coming to my conclusion from the doubled throughput for the packed SP case
> and nearly doubled for the packed DP case (see rcp througputs below), I wasn't aware
> this is due to an improved pipelining, do you have a source to provide for this?
>
>
> SNB IVB
> VDIPS/VSQRTPS 28 14
> VDIVPD/VSQRTPD 44 28
> >
> source : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026
>
>
>
The very same manual:
Latency/Reciprocal Throughput
SNB IVB
DIPSS/DIVPS 13/14 7/14
DIVSD/DIVPD 22/22 20/14
So, it's pretty obvious that both SNB and IVB have 128-bit division units, but on IVB they are partially pipelined.
DIV r64 is even more interesting:
Latency/Reciprocal Throughput
SNB IVB
DIV r64 80-90/??? 35-45/23
So, long integer division on IVB is not just partially pipelined, but they somehow managed to cut worst case latency in half.
Looks like they now apply a different algorithm. Or, may be, just extended to 128b/64b an old "two bits at time" algorithm, that was in use for 64b/32b division since P5,



