Jouni Osmala ( on 9/26/09 wrote:
>>Now that we established that we do not need a GPU nor a video decoding hardware,
>There is multiple order of magnitude difference between special purpose hardware
>and software configurable in terms of perf/power generally. The big difference,
>is that instead of spending huge number of transistors selecting operations, and
>decoding instructions looking dependencies, you just have small units that just do the work.
>Byte addition costs~200 transistors 32bit addition ~1000 transistors. and & or are 4 transistors per bit.
>Shift by constant known before are almost free.
>For multiplier it approximately takes an adder sized of first operand per bit of second operand.

This may be helpful for perspective:

Interesting comparison of MPU energy cost today

64 bit multiply-add - 200 pJ
read 64 bits from cache - 800 pJ
move 64 bits across chip - 2000 pJ
execute an instruction - 7500 pJ
read 64 bits from DRAM - 12000 pJ

Notice it costs 15X more energy to go to memory than
read data from cache. The execution of a multiply-
add instruction burns 97% of the energy in overhead
and 3% in the arithmetic circuits.

Also notice it is more energy efficient to redundantly
perform 9 separate multiply-add operations in different
locations across a chip than do it once in one location
and broadcast the result across the chip! (presuming
the operands were already widely distributed).

from IDF paper TCIS001, slide 19
