TOP3 blunders

By: ? (0xe2.0x9a.0x9b.delete@this.gmail.com), September 26, 2009 10:49 am
Room: Moderated Discussions
anon (anon@anons.org) on 9/26/09 wrote:
---------------------------
>? (0xe2.0x9a.0x9b@gmail.com) on 9/26/09 wrote:
>---------------------------
>>anon (no@mail.biz) on 9/25/09 wrote:
>>---------------------------
>>>In your opinion, what are the TOP 3 (or more) blunders in the history of microcomputers?
>>>
>>>IMHO:
>>>
>>>1) Xerox failing to develop and market the Alto as quickly as possible;
>>>
>>>2) IBM creating the PC in such a hurry that it not only sucked, but in the end,
>>>instead of IBM-PC, it became MS/Intel-PC;
>>>
>>>3) IBM's refusal to properly market OS/2, the last chance they had to kill MS and
>>>own the dominant OS for the PC after the end of the MS-DOS era;
>>>
>>>4) Commodore's "self destruction".
>>
>>I have only one, though it is not historical per se:
>>
>>1) The fact that we are slaves to microchip companies in the matter of application-specific
>>instruction set extensions. In my opinion, it would be more logical to develop a
>>partially configurable architecture (something between a normal CPU and an FPGA)
>>so that an application can adapt the CPU to suit its needs. The "blunder" is that
>>software developers are not demanding this feature, instead they are just politely
>>waiting for the microchip companies to introduce a couple of not-that-useful new instructions.
>>
>
>What kind of extensions?

I don't know if this question can be answered without actually concretely specifying the extension mechanism itself. Anyway, let's suppose that:

1. A CPU contains a bunch of computing elements (=CE) (e.g: multipliers, adders, shifters, comparators, etc). The CPU has multiple instances of each CE type.

2. Each CE has a specific identifier (which is a simple number) and a number of inputs.

3. Application code can (re)define the meaning of certain opcodes by specifying how to route data among CEs.

Now, imagine it is the year ~1995 and Intel just introduced MMX extensions. However, in this alternate reality people realized that instead of hard-wiring new instructions into the CPU it is better if they design a generic mechanism for implementing extensions in application code. So now the question is, assuming we have such a generic mechanism, how do we implement SSE extensions? Lets see:

The SSE instruction "MULPS xmm1, xmm2/mem128" can be broken down into the following commands:

1.1) Send bytes xmm1[0..3] and xmm2/mem128[0..3] to FP multiplier FP-MUL-0. Tag the result with "R-1.1". (In real implementation "TAG-1.1" would be a concrete number, I am using human-readable tags here.)
1.2) Send bytes xmm1[4..7] and xmm2/mem128[4..7] to FP-MUL-1, result has name R-1.2
1.3) Send bytes xmm1[8..11] and xmm2/mem128[8..11] to FP-MUL-2, result has name R-1.3
1.4) Send bytes xmm1[12..15] and xmm2/mem128[12..15] to FP-MUL-3, result has name R-1.4

2.1) Write data tagged as "R-1.1" to bytes xmm[0..3]
2.2) Write data tagged as "R-1.2" to bytes xmm[4..7]
2.3) Write data tagged as "R-1.3" to bytes xmm[8..11]
2.4) Write data tagged as "R-1.4" to bytes xmm[12..15]

Assuming the CPU has more than 4 multipliers, this computation can have throughput of 1 and latency (lat_FMUL + lat_WRITE).

Alternatively, step 1.X can be replaced with two smaller steps: 1st would read data from the XMM register and tag the data pieces, 2nd would be doing the multiplication. Like read-execute-write.

The instruction "MINPS xmm1, xmm2/mem128" can be implemented in a similar way, provided we have:
- a number of floating-point comparators
- CEs for choosing between 2 data packets based on a condition. Note that I did *not* say that we are choosing between 2 floating-point numbers! This is because I am explaining/talking about a *generic* mechanism for routing *data* among CEs. Anyway, lets call this computing element "COND". One step of MINPS can look like this:

X.1) Send data with tag "operand1[0..3]", data with tag "operand2[0..3]", and data with tag "comparison-result-0" to the computing element COND-0. The result has tag "smaller-0".
X.2) Send "operand1[4..7]", "operand2[4..7]", "comparison-result-1" to COND-1, result is "smaller-1"
X.3) Send "operand1[8..11]", "operand2[8..11]", "comparison-result-2" to COND-3, result is "smaller-2"
X.4) Send "operand1[12..15]", "operand2[12..15]", "comparison-result-3" to COND-3, result is "smaller-3"

----

Most SSE1/SSE2/SSE3/SSE4 instructions can be implemented like this. So, we have established that SSEx can be implemented "easily" - the question is what else can we implement (beyond SSE)? The following comes to my mind:

- Integer vector instructions

- Instructions with operands which have arbitrary data layout. For example, suppose you have determined that the optimal bit-length you need is 11 bits per input and 12 bits per output. No problem: you can extract the 5 11-bit numbers from a 64-bit register in parallel, do some 5-way computation and then pack the 5 12-bit results into a 64-bit register.

- Let the inputs and outputs of CEs performing floating-point computations have 3 separate parts per operand: sign, mantissa, exponent. Assuming we have such CEs, we can now try using them in video encoding/decoding applications, games and rendering. No freaking GP-GPU necessary.

----

Now that we established that we do not need a GPU nor a video decoding hardware, what else beyond this can we implement? Let's try to speed up interpreted languages which are using arbitrary-precision integer numbers (Smalltalk, Python, etc):

An integer will be encoded as (BNF):
INTEGER   ::= X IS_DIRECT
IS_DIRECT ::= true | false (1 bit)
X ::= 31_BIT_VALUE | 31_BIT_POINTER (31 bits)


Small integers which fit into 31-bits are encoded with 32-bits of data. The field IS_DIRECT is in bit 31 (the highest bit).

An instruction to ADD such numbers, optimistically assuming that both operands are direct integers, may look like this:

1.1) Read the 1st operand from a 32-bit register. Tag the data as "operand-1".
1.2) Read the 2nd operand from a 32-bit register. Tag the data as "operand-2".

2.1) Send "operand-1" to computing element EXTRACTOR-0 and command it to extract the bit 31. Tag the result as "isDirect-1".
2.2) Send "operand-2" to computing element EXTRACTOR-1 and command it to extract the bit 31. Tag the result as "isDirect-2".

2.4) Send "operand-1" to computing element EXTRACTOR-2 and command it to extract the bits 0..30. Tag the result as "value-1".
2.4) Send "operand-2" to computing element EXTRACTOR-3 and command it to extract the bits 0..30. Tag the result as "value-2".

3.1) Send "isDirect-1" and "isDirect-2" to AND-0. Result is "both-are-direct".
3.2) Send "value-1" and "value-2" to ADDER-0. Results are "sum" and "overflow-through-31-bits".
3.3) Create a constant named "failed" with a 32-bit value of 0.

4) Send "sum", "failed" and "both-are-direct" to COND-0. The result has name "sum-or-zero-1".

5) Send "sum-or-zero-1", "failed" and "overflow-through-31-bits" to COND-0. The result has name "sum-or-zero-2".

6) Write "sum-or-zero-2" into a 32-bit register.

The EXTRACTOR needs to implement bit-shifting, bit-masking and possibly sign extension. Notice that the whole addition takes 6 cycles, in case both inputs as well as the output fit into 31-bits (this is the most common case).

Now, let's implement the same thing in x86 assembly:


(EAX = 1st number)
(EBX = 2nd number)

MOV ECX,EAX

AND ECX,EBX
SHL EAX,1
SHL EBX,1

TEST ECX,(1<<31)
JZ failed

ADD EAX,EBX
JO failed

SHR EAX,1
OR EAX,(1<<31)
JMP done

failed:
MOV EAX,0

done:

(EAX = sum-or-zero)


If I implemented it correctly, it ideally takes some 8-9 cycles to execute this on an OoO x86 processor. Comparing this to the 1+6 cycles for the previous version we can see that the previous version was a little bit faster. The 1 additional cycle is needed for the CPU to realize that we are going to execute an application-specific opcode. Note that the 6 cycles also include load/store from/to general-purpose registers (I am mentioning this because it may not actually be needed - it depends on the implementation of the generic extension mechanism).

---

... does this answer your question? (The question was "What kind of extensions?")

But I think your question is of a minor importance. The major question is: How to implement such a generic mechanism so that it is scalable?
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
TOP3 blundersanon2009/09/25 07:38 PM
  TOP3 blundersrwessel2009/09/25 07:58 PM
  TOP3 blundersPotatoswatter2009/09/26 02:12 AM
    TOP3 blundersJouni Osmala2009/09/26 08:47 AM
    TOP3 blundersWilco2009/09/26 10:51 AM
      TOP3 blundersPotatoswatter2009/09/26 01:04 PM
        TOP3 blundersRagingDragon2009/09/28 02:58 PM
  TOP3 blunders?2009/09/26 03:44 AM
    TOP3 blundersanon2009/09/26 06:55 AM
      TOP3 blunders?2009/09/26 10:49 AM
        TOP3 blundersPotatoswatter2009/09/26 01:38 PM
        TOP3 blundersJouni Osmala2009/09/26 09:52 PM
          TOP3 blunders?2009/09/27 01:08 AM
            implementationAM2009/09/27 01:31 AM
          the generic mechanismAM2009/09/27 01:15 AM
            the generic mechanism?2009/09/27 02:30 AM
          TOP3 blunderssomeone2009/09/27 07:49 AM
            TOP3 blunders?2009/09/27 08:33 AM
              TOP3 blundersBlistering blue barnacles2009/09/28 08:44 PM
              TOP3 blundersslacker2009/09/29 07:28 AM
                (12nJ/cycle)*3GHz = 36W (NT)Michael S2009/09/29 08:31 AM
                  That's what I get for posting so early. (NT)slacker2009/09/29 06:17 PM
          TOP3 blunderskoby m.2009/09/29 07:26 AM
            TOP3 blundersJouni Osmala2009/09/29 11:05 PM
              TOP3 blunders?2009/09/30 02:26 AM
    TOP3 blundersPotatoswatter2009/09/26 07:11 AM
      TOP3 blunderssomeone2009/09/26 09:26 AM
    ISA extensions - tensilica?David Kanter2009/09/26 10:17 AM
      ISA extensions - StretchWes Felter2009/09/26 11:24 AM
        ISA extensions - Stretch?2009/09/26 01:14 PM
          ISA extensions - StretchGabriele Svelto2009/09/28 03:14 AM
  TOP3 blundersanonymous2009/09/26 06:23 AM
  TOP3 smart movesanonib2009/09/26 07:11 PM
    TOP3 smart movesDavid W. Hess2009/09/26 07:55 PM
      TOP3 smart movesRichard Cownie2009/09/27 01:57 PM
        TOP3 smart movesRagingDragon2009/09/28 03:46 PM
    TOP3 smart movesa reader2009/09/27 08:44 AM
    TOP3 smart movesRichard Cownie2009/09/27 12:12 PM
      TOP2 probably illegal smart movesRichard Cownie2009/09/27 01:43 PM
        TOP2 probably illegal smart movesnn2009/09/27 07:27 PM
        TOP2 probably illegal smart movesDavid Kanter2009/09/29 01:48 AM
          TOP2 probably illegal smart movesRichard Cownie2009/09/29 10:31 AM
            TOP2 probably illegal smart movesMichael S2009/09/29 11:04 AM
              TOP2 probably illegal smart movesDavid Kanter2009/09/29 01:35 PM
              TOP2 probably illegal smart movesRichard Cownie2009/09/29 04:28 PM
              TOP2 probably illegal smart movesJouni Osmala2009/09/29 11:08 PM
              Link to good analysis on the matter.Jouni Osmala2009/09/29 11:22 PM
                Link to good analysis on the matter.Richard Cownie2009/09/30 09:59 AM
  TOP3 blunder of todayMichael S2009/09/29 04:31 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell green?