Observations, Notes and Critical Analysis of Transmeta’s Work
The scarcity of ALUs (only 2 are provided) appears to be a major impediment to the performance of both CMS itself and the generated code. Since all address generation is performed in these ALUs prior to forwarding the final virtual address to the LSU, the TM5xxx is at a clear disadvantage to the P6 and K7 microarchitectures (which have dedicated AGUs for each LSU). This appears to have been remedied in the just released TM8000 design, however.
Transmeta clearly also made major performance sacrifices in exchange for low power. For instance, as described in the ALU section, apparently only ALU1 can do shift operations. This can seriously impact some x86 code, not to mention the CMS translation code itself.
The instruction encoding also incurs a large number of wasted no-op slots: out of the roughly 231313 instructions in the CMS version examined, 70363 (over 30%) were nops. This could easily be remedied by using a stop bit based format as in TI DSPs or IA-64; I suspect the TM8000 has switched to such a format.
Most of the non time critical parts of CMS itself appear to be written in C, and compiled with a version of gcc and binutils hacked to generate TM5xxx code. This is evident from the relatively poor scheduling of certain code sequences, compared to what a good trace scheduler or programmer could do.
The released CMS image also appears to contain external debugger support, possibly accessible through the JTAG port. It is known that major hardware vendors have been given the appropriate debugging kits, so it may be possible to activate the CMS debug code externally.
Even though the processor is fully interlocked via register operand score-boarding, there are many instances where delay slots were present in the disassembled code even though they are not strictly necessary nor present in other places. This is most likely the result of hardware bugs in the pipeline logic, which Transmeta has in fact admitted are masked by CMS in shipping hardware.
There are also a number of cases where instructions which could normally be paired together are not scheduled this way. It has been documented that CMS uses a different instruction mix at different clock speeds such that none of the critical circuit timing paths used by a given bundle will exceed the clock period.
The processor is typically clocked at the maximum frequency while running code within CMS itself, since this is infrequent yet obviously must be done as fast as possible. However, this does appear to limit the valid instruction combinations found in the CMS image to a subset of those possible when running at slower clock speeds.
Discuss (6 comments)