# Barcelona optimization guide

By: Vincent Diepeveen (diep.delete@this.xs4all.nl), May 16, 2007 9:51 am
EduardoS (no@spam.com) on 5/16/07 wrote:
---------------------------
>>Let's distinguish the questions to the specific instructions:
>>
>>unsigned integer 64 bits multiply x 64 bits delivering a 128 bits unsigned integer result.
>>
>>Throughput of 1 how must i interpret that?
>>
>>Does it mean that it while those 5 cycles of the multiply get spend, the cpu CAN
>>practically execute 4 x 2 = 8 uops of 64 bits unsigned integer bitfiddling in the other 2 execution units?
>>
>>So in short the improvement in barcelona over the currently sold K8 is that it
>>no longer blocks other execution units during a part of its execution?
>>
>>Is it correct that Barcelona improved here over current K8?
>
>Let's say... For the full multiplication it is issued in the first ALU and that
>ALU (and only that ALU) will be "locked" for 2 clocks (0.5 Throughput), after that
>So, for example, a multiplication on the first clock, a new one on the third, one more on the fifth, etc.
>
>>secondly there is the different SIMD instructions.
>>When executing a SIMD instruction, for example the one that multiplies 2 doubles
>>* 2 doubles == 2 doubles (a double being a 64 bits floating point with 52+ bits
>>mantissa), which is so crucial for highend calculations, such as research to find
>>new medicines (eating an impressive 0.5% of system time at big supercomputers).
>>
>>That multiplication eats 5 cycles, and 2 cycles out of those 5 cycles the cpu totally
>>blocks other instructions getting executed, so here we can execute 3 x 2 = 6 uops
>>to integer instructions while that SIMD multiply occurs.
>>
>>Do i say formulate it correctly like this?
>
>During this period (2 clocks) other units will be able to execute... 4 uops...
>or 16 uops (2 in each ALU, AGU, FADD and FMISC, 3xALU + 3xAGU + 1FADD + 1FMISC) if you ignore decoding and retirenment.
>The packed floating point multiplication generates 2 macro-ops on K-8 (one on Barcelona)
>and the floating point multiplication unit can receive a new macro-op every clock,
>although the result only will be ready 4 clocks after.
>

This is very clear!

Many thanks for the explanation,
Vincent
TopicPosted ByDate
Barcelona optimization guidemas2007/05/10 07:43 AM
Barcelona optimization guideLinus Torvalds2007/05/10 10:00 AM
Barcelona optimization guideRob Thorpe2007/05/10 10:23 AM
Barcelona optimization guideLinus Torvalds2007/05/10 10:42 AM
Barcelona optimization guideRob Thorpe2007/05/11 09:22 AM
Barcelona optimization guideDavid Kanter2007/05/11 05:17 PM
Barcelona optimization guideLinus Torvalds2007/05/11 05:30 PM
Barcelona optimization guideanonymous2007/05/11 11:29 PM
Barcelona optimization guideanonymous2007/05/12 07:47 AM
Barcelona optimization guidehobold2007/05/14 05:30 AM
Barcelona optimization guideAndreas Kaiser2007/05/12 09:32 AM
Barcelona optimization guideVincent Diepeveen2007/05/13 05:20 AM
Barcelona optimization guideEduardoS2007/05/13 07:01 AM
Barcelona optimization guideVincent Diepeveen2007/05/13 09:18 AM
Barcelona optimization guideMichael S2007/05/13 10:03 AM
Barcelona optimization guideEduardoS2007/05/13 10:30 AM
Barcelona optimization guideDresdenboy2007/05/14 08:18 AM
Barcelona optimization guideVincent Diepeveen2007/05/16 02:36 AM
Barcelona optimization guideEduardoS2007/05/16 06:57 AM
Barcelona optimization guideVincent Diepeveen2007/05/16 09:51 AM
Barcelona optimization guideDavid Kanter2007/05/16 04:13 AM
Barcelona vs Core2 Vincent Diepeveen2007/05/16 06:35 AM
Barcelona vs Core2 David Kanter2007/05/16 12:06 PM
Barcelona vs Core2 EduardoS2007/05/16 12:41 PM
Barcelona vs Core2 David Kanter2007/05/16 12:53 PM
Barcelona vs Core2 EduardoS2007/05/16 01:37 PM
Barcelona vs Core2 David Kanter2007/05/16 02:43 PM
Barcelona vs Core2 EduardoS2007/05/16 04:32 PM
Barcelona vs Core2 Gabriele Svelto2007/05/17 06:38 AM
Barcelona optimization guideanonymous2007/05/16 08:13 PM
Barcelona optimization guideMichael S2007/05/17 05:26 AM
Barcelona optimization guideanonymous2007/05/17 06:23 PM