By: Eugene Nalimov (enalimov.delete@this.at.contextrelevant.dot.com), August 11, 2014 3:07 pm
Room: Moderated Discussions
1. For Visual C, active development of major new x87-related optimizations stopped around year 2001-2003. It was assumed that customers who absolutely need FP performance would move to 64-bit land (first Itanium, later x64). There was some work done by developers from CPU makers, but not from the "mainline" team. I came up with some major improvements ideas in 2004 (or was it 2005?), but it was decided results are not worth the cost. I don't know current status -- Microsoft and me parted ways in 2009 -- but I would be surprised if they spend any resources on improving 32-bit x87 code generation. At least I'd recommend them not to.
2. Early versions of Windows for x64 did not save x87 context during context switch, so we could not use x87 in the generated code. Later Windows were changed and I suggested generating x87 code for x64 in some cases: x87 instructions are shorter than SSE2 ones, so that definitely makes sense when optimizing for size and can make sense when optimizing for speed as well. Unfortunately we were very seriously undermanned, so we could not work on it. I think that may be good optimization even now but I do not work on MSVC anymore...
Thanks,
Eugene
Klimax (danklima.delete@this.gmail.com) on August 10, 2014 8:48 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on August 9, 2014 2:37 pm wrote:
> > Klimax (danklima.delete@this.gmail.com) on August 9, 2014 2:10 pm wrote:
> >
> > >
> > > It seems per Optimization manual that there are no longer significant
> > > performance differences between x87 instructions and scalar SSEx.
> > >
> > > (latency/throughput)
> > > FADD 3/1 vs. ADDSS (SSE1) 3/1
> > > FMUL 5/2 MULSS 5/1
> > >
> >
> > The [scalar] x87 tax is not in instructions latency/throughput and never was (in fact,
> > on P4 x87 FADD throughput was better than scalar SSE2 throughput). The tax is
> > 1) in need to to use more regmoves or exchanges to achieve the same result.
> > And no, despite what opt. manual may claim, they are never 100% free
> > 2) in sw-visible register starvation. 8 visible registers is often enough for GPRs, but rarely enough
> > for inner FP loops on wide machine. That's true even on SB/IB which theoretically can do 2 loads per clock,
> > but even more so on previous Intel Core CPUs that only, despite having the same scalar FPU width as SB,
> > can only do 1 load per clock. Of course, the same problem applies to 32-bit SSE/AVX as well.
> > 3) in hard to understand but very real fact that after 35 years of trying the 2 popular compilers,
> > i.e. MSVC and gcc, still suck in x87 register allocation and associated stuff. I still can realatively
> > easily beat either of the two in x87. Of course, sometimes I can beat them in [scalsr] 32-bit SSX/AVX
> > or even (much much rarer) in 64-bit SSE/AVX, but never by the same margin as in x87.
> > 4) x87 also sucks in more rare but not really exotic areas as fp-integer conversions
> > and moving data to/from GPRs. Original x87 also sucked in delivery of condition
> > codes to main execution engine, but that was fixed ~20 years ago.
> >
> >
> > > Note: At least Visual Studio will use by default SSEx scalar instructions
> > > for x64 and when arch:SSE or higher enabled (or when targeting Vista+)
> > >
> > > Since 2010 IIRC.
> >
> > As far as recollect, my copy of VS2010 at work can't generate x87 code on x64 at all, not just by default.
> >
> >
>
> For first part. OK. Might be. Never got yet to do any comparison.
> Although some did, but IIRC it was on older then SB CPUs)
>
> Interesting, just tested it. Can't force x87 for x64.
2. Early versions of Windows for x64 did not save x87 context during context switch, so we could not use x87 in the generated code. Later Windows were changed and I suggested generating x87 code for x64 in some cases: x87 instructions are shorter than SSE2 ones, so that definitely makes sense when optimizing for size and can make sense when optimizing for speed as well. Unfortunately we were very seriously undermanned, so we could not work on it. I think that may be good optimization even now but I do not work on MSVC anymore...
Thanks,
Eugene
Klimax (danklima.delete@this.gmail.com) on August 10, 2014 8:48 am wrote:
> Michael S (already5chosen.delete@this.yahoo.com) on August 9, 2014 2:37 pm wrote:
> > Klimax (danklima.delete@this.gmail.com) on August 9, 2014 2:10 pm wrote:
> >
> > >
> > > It seems per Optimization manual that there are no longer significant
> > > performance differences between x87 instructions and scalar SSEx.
> > >
> > > (latency/throughput)
> > > FADD 3/1 vs. ADDSS (SSE1) 3/1
> > > FMUL 5/2 MULSS 5/1
> > >
> >
> > The [scalar] x87 tax is not in instructions latency/throughput and never was (in fact,
> > on P4 x87 FADD throughput was better than scalar SSE2 throughput). The tax is
> > 1) in need to to use more regmoves or exchanges to achieve the same result.
> > And no, despite what opt. manual may claim, they are never 100% free
> > 2) in sw-visible register starvation. 8 visible registers is often enough for GPRs, but rarely enough
> > for inner FP loops on wide machine. That's true even on SB/IB which theoretically can do 2 loads per clock,
> > but even more so on previous Intel Core CPUs that only, despite having the same scalar FPU width as SB,
> > can only do 1 load per clock. Of course, the same problem applies to 32-bit SSE/AVX as well.
> > 3) in hard to understand but very real fact that after 35 years of trying the 2 popular compilers,
> > i.e. MSVC and gcc, still suck in x87 register allocation and associated stuff. I still can realatively
> > easily beat either of the two in x87. Of course, sometimes I can beat them in [scalsr] 32-bit SSX/AVX
> > or even (much much rarer) in 64-bit SSE/AVX, but never by the same margin as in x87.
> > 4) x87 also sucks in more rare but not really exotic areas as fp-integer conversions
> > and moving data to/from GPRs. Original x87 also sucked in delivery of condition
> > codes to main execution engine, but that was fixed ~20 years ago.
> >
> >
> > > Note: At least Visual Studio will use by default SSEx scalar instructions
> > > for x64 and when arch:SSE or higher enabled (or when targeting Vista+)
> > >
> > > Since 2010 IIRC.
> >
> > As far as recollect, my copy of VS2010 at work can't generate x87 code on x64 at all, not just by default.
> >
> >
>
> For first part. OK. Might be. Never got yet to do any comparison.
> Although some did, but IIRC it was on older then SB CPUs)
>
> Interesting, just tested it. Can't force x87 for x64.