Jouni Osmala ( on 9/30/09 wrote:
>>>Now you want to add a second order programmability over those millions of transistors
>>>already required to do the work of making first order programmability, by turning
>>>their special purpose function to a generic function.
>>> . . .
>>>The ancient method of providing ISA programmability to programmers was that they
>>>could turn couple of instruction to multiple instructions that would be issued over
>>>multiple cycles, but those instructions where normal instructions and nothing fancy.
>>>But JMP instruction +instruction cache does the same function but in a more generic
>>>way. And you don't need the extra costs in decoding phase with use of those.
>>So, IOW, with current ISA(s), providing a "second order programmability" is nothing
>>more than a bandwidth/latency problem?
>Actually that was my way of saying that ancient methods of rewriting ISA, doesn't
>really help. Since they are pretty much more restricted way of doing same thing
>as a normal function call. What was proposed in this board was something that (you
>couldn't really combine with OoOE and normal programming model )OR( would hamper
>instruction latencies so badly that it would be huge net loss).

Hmm, I agree with:

- that it would disrupt the normal programming model, mostly from the perspective of compilers. - OK, but this is what compiler research is for, including research in programming models.

- latencies of operations performed by generalized computing elements (CEs) would be different than latencies of more hard-wired units present in today's CPUs. If not latencies, then the frequency. - OK, but: they can be pipelined, there might be a lot more instances of a particular CE type, a more general CE can be *directly* used in more situations than a less general one.

- the amount of [transistors dedicated to routing packets/messages/data among CEs] is non-trivial. - OK, but: In worst cases scenarios, the routing itself would take multiple cycles. In best case scenarios, the routing itself could probably be implemented in such a way that it takes 1 cycle, or effectively no cycles at all (via pipelining). The *average* number of packets which would be routed per cycle depends on the probability of worst-case vs. best-base. If say ~95% of actual code falls in the best case, then you do not have a problem.

In addition, may I point out that in the suggested low-level programming model (for the sake of simplicity, you can think of it as TTA, or data-flow, or user-visible OoO engine) you/compiler are able to express that some low-level operations A, B and C should be executed in parallel. Whereas in today's x86 code you cannot express such a thing.

I agree, that the overall concept is more complex (in multiple non-trivial respects) compared to what can be found in today's x86 CPUs - but I think you (all) can agree with me that it is a legitimate alternative to what is happening today: spending the transistor budget on adding more cores and/or doing a more sophisticated code analysis at run-time in hardware. Imagine an alternative universe where the transistor budget is being spent in an alternative way ...
