Barcelona optimization guide

By: Vincent Diepeveen (diep.delete@this.xs4all.nl), May 16, 2007 1:36 am
Room: Moderated Discussions
Dresdenboy (M.Waldhauer@gmx.de) on 5/14/07 wrote:
---------------------------
>Vincent Diepeveen (diep@xs4all.nl) on 5/13/07 wrote:
>---------------------------
>>EduardoS (no@spam.com) on 5/13/07 wrote:
>>---------------------------
>>>The latency is 4 cycles for the lower 64 bits, 5 for the upper, comparing that upper to other processors:
>>>K-8: 5 cycles
>>>C2D: 7 cycles
>>>P4E: 11 cycles
>>>Multiply is a complex instruction, 5 cycles is ok.
>>
>>So if your car can drive at most at 100 KM/h,
>>then it is an ok driving speed considering
>>T-ford getting a speed at most of 40 KM/h ??
>
>But throughput is still 1/cycle for 64 bit muls and 0.5/cycle for 128 bit muls.
>Problem is only, that the result isn't available that quickly, but OOO execution
>will manage to hide that to some extent.

Let's distinguish the questions to the specific instructions:

unsigned integer 64 bits multiply x 64 bits delivering a 128 bits unsigned integer result.

Throughput of 1 how must i interpret that?

Does it mean that it while those 5 cycles of the multiply get spend, the cpu CAN practically execute 4 x 2 = 8 uops of 64 bits unsigned integer bitfiddling in the other 2 execution units?

So in short the improvement in barcelona over the currently sold K8 is that it no longer blocks other execution units during a part of its execution?

Is it correct that Barcelona improved here over current K8?

secondly there is the different SIMD instructions.
When executing a SIMD instruction, for example the one that multiplies 2 doubles * 2 doubles == 2 doubles (a double being a 64 bits floating point with 52+ bits mantissa), which is so crucial for highend calculations, such as research to find new medicines (eating an impressive 0.5% of system time at big supercomputers).

That multiplication eats 5 cycles, and 2 cycles out of those 5 cycles the cpu totally blocks other instructions getting executed, so here we can execute 3 x 2 = 6 uops to integer instructions while that SIMD multiply occurs.

Do i say formulate it correctly like this?

Thanks,
Vincent

>>
>>>>I was under the probably wrong misconception that it would be able to execute
>4 instructions a cycle at integer area.
>>>>
>>>>When i read manual correct i deduce it's executing 3 instructions maximum a cycle.
>>>>
>>>>So Intel has bigger surplus potential of 33% for good programmers, did i read that
>>>>correct, or am i wrong there and is core2 not having 4 integer units either?
>>>
>>>Core 2 have 4 decoders (up to 5 instructions per clock with macro-fusion) but only
>>>3 integer units and can retire only 4 uOPs per cycle, Barcelona isn't too far behind, if any.
>>
>>If core2 can retire 4 uops per cycle and barcelona can retire 3 uops a cycle i
>>understand, then core2 can blow that barcelona core completely away. That's 33% faster speed.
>
>As already said, µOps/MacroOps are different. If an x86 instruction contains complex
>memory operand addresses then C2D has to create at least 2 µops, while already K8
>created only 1 MacroOp then. So 3 MacroOps/cycle mean up to 6 µOps/cycle, but without any explicitly fused ops.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Barcelona optimization guidemas2007/05/10 06:43 AM
  Barcelona optimization guideLinus Torvalds2007/05/10 09:00 AM
    Barcelona optimization guideRob Thorpe2007/05/10 09:23 AM
      Barcelona optimization guideLinus Torvalds2007/05/10 09:42 AM
        Barcelona optimization guideRob Thorpe2007/05/11 08:22 AM
          Barcelona optimization guideDavid Kanter2007/05/11 04:17 PM
            Barcelona optimization guideLinus Torvalds2007/05/11 04:30 PM
            Barcelona optimization guideanonymous2007/05/11 10:29 PM
              Barcelona optimization guideanonymous2007/05/12 06:47 AM
              Barcelona optimization guidehobold2007/05/14 04:30 AM
        Barcelona optimization guideAndreas Kaiser2007/05/12 08:32 AM
  Barcelona optimization guideVincent Diepeveen2007/05/13 04:20 AM
    Barcelona optimization guideEduardoS2007/05/13 06:01 AM
      Barcelona optimization guideVincent Diepeveen2007/05/13 08:18 AM
        Barcelona optimization guideMichael S2007/05/13 09:03 AM
        Barcelona optimization guideEduardoS2007/05/13 09:30 AM
        Barcelona optimization guideDresdenboy2007/05/14 07:18 AM
          Barcelona optimization guideVincent Diepeveen2007/05/16 01:36 AM
            Barcelona optimization guideEduardoS2007/05/16 05:57 AM
              Barcelona optimization guideVincent Diepeveen2007/05/16 08:51 AM
        Barcelona optimization guideDavid Kanter2007/05/16 03:13 AM
          Barcelona vs Core2 Vincent Diepeveen2007/05/16 05:35 AM
            Barcelona vs Core2 David Kanter2007/05/16 11:06 AM
            Barcelona vs Core2 EduardoS2007/05/16 11:41 AM
              Barcelona vs Core2 David Kanter2007/05/16 11:53 AM
                Barcelona vs Core2 EduardoS2007/05/16 12:37 PM
                  Barcelona vs Core2 David Kanter2007/05/16 01:43 PM
                    Barcelona vs Core2 EduardoS2007/05/16 03:32 PM
                    Barcelona vs Core2 Gabriele Svelto2007/05/17 05:38 AM
          Barcelona optimization guideanonymous2007/05/16 07:13 PM
            Barcelona optimization guideMichael S2007/05/17 04:26 AM
              Barcelona optimization guideanonymous2007/05/17 05:23 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?