TOP3 blunders

By: Jouni Osmala (, September 26, 2009 8:52 pm
Room: Moderated Discussions
>Now that we established that we do not need a GPU nor a video decoding hardware,

There is multiple order of magnitude difference between special purpose hardware and software configurable in terms of perf/power generally. The big difference, is that instead of spending huge number of transistors selecting operations, and decoding instructions looking dependencies, you just have small units that just do the work.

Byte addition costs~200 transistors 32bit addition ~1000 transistors. and & or are 4 transistors per bit.
Shift by constant known before are almost free.
For multiplier it approximately takes an adder sized of first operand per bit of second operand.

And then you compare those costs the tens of millions of logic transistors in modern microprocessors. Which is programmable method of doing same thing.Tens of millions of active transistors working to execute small sequence of instructions that eventually are just couple of thousands transistors if done in hardware without all the baggage of programmability.With exception of pipelined multiplier & divider execution units are free.

What costs is the generic routing networks and instruction decoding and instruction selection and exception handling and TLB handling, and branch prediction and ... Basicly everything that makes it programmable, and to make it easier for programmers to run their programs fast.

Now you want to add a second order programmability over those millions of transistors already required to do the work of making first order programmability, by turning their special purpose function to a generic function.

>(EAX = 1st number)
>(EBX = 2nd number)

>TEST ECX,(1<<31)
>JZ failed
>JO failed
>OR EAX,(1<<31)
>JMP done
>(EAX = sum-or-zero)
>If I implemented it correctly, it ideally takes some 8-9 cycles to execute this
>on an OoO x86 processor.

The code you put should take 4 cycles on modern x86 processor if the failed condition doesn't happen and branchpredictor would work. Even code like that would get 4 cycles from OoO logic, you could get similar with inorder RISC with proper scheduling but not from inorder x86.
And 15 cycles more if branch predictor was wrong.

Didn't check if your code worked, just did scheduling for it just like CPU does with its OoO logic or compiler does it with inorder machines.

>But I think your question is of a minor importance. The major question is: How
>to implement such a generic mechanism so that it is scalable?

The generic mechanism doesn't really work. The OoO scheduling becomes impossible in your scheme. The extra transistors in critical paths, take power and either slowdown the clockspeed or makes your EVERY instruction have multicycle latency.

The ancient method of providing ISA programmability to programmers was that they could turn couple of instruction to multiple instructions that would be issued over multiple cycles, but those instructions where normal instructions and nothing fancy. But JMP instruction +instruction cache does the same function but in a more generic way. And you don't need the extra costs in decoding phase with use of those.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TOP3 blundersanon2009/09/25 06:38 PM
  TOP3 blundersrwessel2009/09/25 06:58 PM
  TOP3 blundersPotatoswatter2009/09/26 01:12 AM
    TOP3 blundersJouni Osmala2009/09/26 07:47 AM
    TOP3 blundersWilco2009/09/26 09:51 AM
      TOP3 blundersPotatoswatter2009/09/26 12:04 PM
        TOP3 blundersRagingDragon2009/09/28 01:58 PM
  TOP3 blunders?2009/09/26 02:44 AM
    TOP3 blundersanon2009/09/26 05:55 AM
      TOP3 blunders?2009/09/26 09:49 AM
        TOP3 blundersPotatoswatter2009/09/26 12:38 PM
        TOP3 blundersJouni Osmala2009/09/26 08:52 PM
          TOP3 blunders?2009/09/27 12:08 AM
            implementationAM2009/09/27 12:31 AM
          the generic mechanismAM2009/09/27 12:15 AM
            the generic mechanism?2009/09/27 01:30 AM
          TOP3 blunderssomeone2009/09/27 06:49 AM
            TOP3 blunders?2009/09/27 07:33 AM
              TOP3 blundersBlistering blue barnacles2009/09/28 07:44 PM
              TOP3 blundersslacker2009/09/29 06:28 AM
                (12nJ/cycle)*3GHz = 36W (NT)Michael S2009/09/29 07:31 AM
                  That's what I get for posting so early. (NT)slacker2009/09/29 05:17 PM
          TOP3 blunderskoby m.2009/09/29 06:26 AM
            TOP3 blundersJouni Osmala2009/09/29 10:05 PM
              TOP3 blunders?2009/09/30 01:26 AM
    TOP3 blundersPotatoswatter2009/09/26 06:11 AM
      TOP3 blunderssomeone2009/09/26 08:26 AM
    ISA extensions - tensilica?David Kanter2009/09/26 09:17 AM
      ISA extensions - StretchWes Felter2009/09/26 10:24 AM
        ISA extensions - Stretch?2009/09/26 12:14 PM
          ISA extensions - StretchGabriele Svelto2009/09/28 02:14 AM
  TOP3 blundersanonymous2009/09/26 05:23 AM
  TOP3 smart movesanonib2009/09/26 06:11 PM
    TOP3 smart movesDavid W. Hess2009/09/26 06:55 PM
      TOP3 smart movesRichard Cownie2009/09/27 12:57 PM
        TOP3 smart movesRagingDragon2009/09/28 02:46 PM
    TOP3 smart movesa reader2009/09/27 07:44 AM
    TOP3 smart movesRichard Cownie2009/09/27 11:12 AM
      TOP2 probably illegal smart movesRichard Cownie2009/09/27 12:43 PM
        TOP2 probably illegal smart movesnn2009/09/27 06:27 PM
        TOP2 probably illegal smart movesDavid Kanter2009/09/29 12:48 AM
          TOP2 probably illegal smart movesRichard Cownie2009/09/29 09:31 AM
            TOP2 probably illegal smart movesMichael S2009/09/29 10:04 AM
              TOP2 probably illegal smart movesDavid Kanter2009/09/29 12:35 PM
              TOP2 probably illegal smart movesRichard Cownie2009/09/29 03:28 PM
              TOP2 probably illegal smart movesJouni Osmala2009/09/29 10:08 PM
              Link to good analysis on the matter.Jouni Osmala2009/09/29 10:22 PM
                Link to good analysis on the matter.Richard Cownie2009/09/30 08:59 AM
  TOP3 blunder of todayMichael S2009/09/29 03:31 AM
Reply to this Topic
Body: No Text
How do you spell purple?