Jouni Osmala ( on 9/26/09 wrote:
>>Now that we established that we do not need a GPU nor a video decoding hardware,
>There is multiple order of magnitude difference between special purpose hardware
>and software configurable in terms of perf/power generally. The big difference,
>is that instead of spending huge number of transistors selecting operations, and
>decoding instructions looking dependencies, you just have small units that just do the work.
>Byte addition costs~200 transistors 32bit addition ~1000 transistors. and & or are 4 transistors per bit.
>Shift by constant known before are almost free.
>For multiplier it approximately takes an adder sized of first operand per bit of second operand.
>And then you compare those costs the tens of millions of logic transistors in modern
>microprocessors. Which is programmable method of doing same thing.Tens of millions
>of active transistors working to execute small sequence of instructions that eventually
>are just couple of thousands transistors if done in hardware without all the baggage
>of programmability.With exception of pipelined multiplier & divider execution units are free.
>What costs is the generic routing networks and instruction decoding and instruction
>selection and exception handling and TLB handling, and branch prediction and ...
>Basicly everything that makes it programmable, and to make it easier for programmers to run their programs fast.
>Now you want to add a second order programmability over those millions of transistors
>already required to do the work of making first order programmability, by turning
>their special purpose function to a generic function.

I generally agree with your assessment. But I am more interested in how to make it work, and not in how to make it fail.

>>(EAX = 1st number)
>>(EBX = 2nd number)
>>TEST ECX,(1<<31)
>>JZ failed
>>JO failed
>>OR EAX,(1<<31)
>>JMP done
>>(EAX = sum-or-zero)
>>If I implemented it correctly, it ideally takes some 8-9 cycles to execute this
>>on an OoO x86 processor.
>The code you put should take 4 cycles on modern x86 processor if the failed condition
>doesn't happen and branchpredictor would work. Even code like that would get 4 cycles
>from OoO logic, you could get similar with inorder RISC with proper scheduling but not from inorder x86.
>And 15 cycles more if branch predictor was wrong.

I put the code into a loop and got an IPC of 1.8 on my notebook CPU. This suggests it takes (11/1.8) = ~6 cycles to execute.

>Didn't check if your code worked, just did scheduling for it just like CPU does
>with its OoO logic or compiler does it with inorder machines.
>>But I think your question is of a minor importance. The major question is: How
>>to implement such a generic mechanism so that it is scalable?
>The generic mechanism doesn't really work. The OoO scheduling becomes impossible
>in your scheme. The extra transistors in critical paths, take power and either slowdown
>the clockspeed or makes your EVERY instruction have multicycle latency.
>The ancient method of providing ISA programmability to programmers was that they
>could turn couple of instruction to multiple instructions that would be issued over
>multiple cycles, but those instructions where normal instructions and nothing fancy.
>But JMP instruction +instruction cache does the same function but in a more generic
>way. And you don't need the extra costs in decoding phase with use of those.

... I agree with most of what you wrote. The only place where I do not agree is that it can be implemented and should be implemented.

Your argument is, basically, that current x86 CPUs exhibit the right balance between complexity, usefulness, power and performance.
