Barcelona optimization guide

By: Vincent Diepeveen (diep.delete@this.xs4all.nl), May 13, 2007 8:18 am
Room: Moderated Discussions
EduardoS (no@spam.com) on 5/13/07 wrote:
---------------------------
>You didn't understand it correctly...
>
>>It mentions for example latency of multiplication instruction is 5 cycles (!!).
>>Even an instruction like PMULUDQ is 5 cycles whereas i don't see why this can't get done faster.
>>
>>Integer multiply 64 x 64 bits unsigned == 128 bits also seemingly has a latency
>>of 5 cycles i read at appendix A page 219.
>>
>>Now i am not so good in counting and speaking for myself here, as i'm not very
>>good in math, so i might have missed it from a previous generation K8, but i tend
>>to remember that it was 4 cycles there, from an email i had.
>
>The latency is 4 cycles for the lower 64 bits, 5 for the upper, comparing that upper to other processors:
>K-8: 5 cycles
>C2D: 7 cycles
>P4E: 11 cycles
>Multiply is a complex instruction, 5 cycles is ok.

So if your car can drive at most at 100 KM/h,
then it is an ok driving speed considering
T-ford getting a speed at most of 40 KM/h ??

>>I was under the probably wrong misconception that it would be able to execute 4 instructions a cycle at integer area.
>>
>>When i read manual correct i deduce it's executing 3 instructions maximum a cycle.
>>
>>So Intel has bigger surplus potential of 33% for good programmers, did i read that
>>correct, or am i wrong there and is core2 not having 4 integer units either?
>
>Core 2 have 4 decoders (up to 5 instructions per clock with macro-fusion) but only
>3 integer units and can retire only 4 uOPs per cycle, Barcelona isn't too far behind, if any.

If core2 can retire 4 uops per cycle and barcelona can retire 3 uops a cycle i understand, then core2 can blow that barcelona core completely away. That's 33% faster speed.

>>In the optimization manual it prefers to replace short loops of 4x4 already by
>>hand written out code for those 16 cases.
>>
>>Why can't the processor perfectly predict such loops just like intel seemingly is doing?
>
>You must have a history to predict a branch, the fetcher must go to somewhere else
>(throwing away some instructions fetched), and you must have an extra instruction,
>(a Jcc), so hand written still faster.

No need to have a history to predict next branch:

for( i = 0 ; i < 4 ; i++ ) {
for( j = 0 ; j < 4 ; j++ ) {
...
}
}

If K8-barcelona can't perfectly predict that, then AMD has some work to do.

>>It is 2007 now, no longer 1997.
>>
>>Seems their aim is to serve floating point.
>>
>>If a problem in its fundament is integer oriented and if you are not capable of
>>rewriting an algorithm to integers, then you need to rethink your solution. Floating
>>point is completely overrated IMHO.
>
>IMHO floating point is a lazy way to handle non-integers numbers, but a lazy way
>is also faster, faster is also cheaper, and cost is one of most important factors in software development ;)
>
>>SSEx is not paying enough attention to allow to vectorize integer codes. Putting
>>all cards at floating point AMD obviously tries to conquer the highend market.
>
>True, they could do a better work at integer SSEx.
>
>>It is interesting therefore that AMD already includes popcount into its processor.
>
>The easiest way to know if a integer is a power of 2.

Actually it is meant to count the total number of bits. Detecting whether just 1 bit is set, there is a zillion ways to do that faster than with a popcount function, as most likely it is a slow function.

Assuming it's non-zero filled then :
a&(a-1)

Will give zero if it is a power of 2.

Vincent

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Barcelona optimization guidemas2007/05/10 06:43 AM
  Barcelona optimization guideLinus Torvalds2007/05/10 09:00 AM
    Barcelona optimization guideRob Thorpe2007/05/10 09:23 AM
      Barcelona optimization guideLinus Torvalds2007/05/10 09:42 AM
        Barcelona optimization guideRob Thorpe2007/05/11 08:22 AM
          Barcelona optimization guideDavid Kanter2007/05/11 04:17 PM
            Barcelona optimization guideLinus Torvalds2007/05/11 04:30 PM
            Barcelona optimization guideanonymous2007/05/11 10:29 PM
              Barcelona optimization guideanonymous2007/05/12 06:47 AM
              Barcelona optimization guidehobold2007/05/14 04:30 AM
        Barcelona optimization guideAndreas Kaiser2007/05/12 08:32 AM
  Barcelona optimization guideVincent Diepeveen2007/05/13 04:20 AM
    Barcelona optimization guideEduardoS2007/05/13 06:01 AM
      Barcelona optimization guideVincent Diepeveen2007/05/13 08:18 AM
        Barcelona optimization guideMichael S2007/05/13 09:03 AM
        Barcelona optimization guideEduardoS2007/05/13 09:30 AM
        Barcelona optimization guideDresdenboy2007/05/14 07:18 AM
          Barcelona optimization guideVincent Diepeveen2007/05/16 01:36 AM
            Barcelona optimization guideEduardoS2007/05/16 05:57 AM
              Barcelona optimization guideVincent Diepeveen2007/05/16 08:51 AM
        Barcelona optimization guideDavid Kanter2007/05/16 03:13 AM
          Barcelona vs Core2 Vincent Diepeveen2007/05/16 05:35 AM
            Barcelona vs Core2 David Kanter2007/05/16 11:06 AM
            Barcelona vs Core2 EduardoS2007/05/16 11:41 AM
              Barcelona vs Core2 David Kanter2007/05/16 11:53 AM
                Barcelona vs Core2 EduardoS2007/05/16 12:37 PM
                  Barcelona vs Core2 David Kanter2007/05/16 01:43 PM
                    Barcelona vs Core2 EduardoS2007/05/16 03:32 PM
                    Barcelona vs Core2 Gabriele Svelto2007/05/17 05:38 AM
          Barcelona optimization guideanonymous2007/05/16 07:13 PM
            Barcelona optimization guideMichael S2007/05/17 04:26 AM
              Barcelona optimization guideanonymous2007/05/17 05:23 PM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?