By: Ronald Maas (rmaas.delete@this.wiwo.nl), November 29, 2014 9:58 am
Room: Moderated Discussions
Jouni Osmala (josmala.delete@this.cc.hut.fi) on November 25, 2014 4:40 am wrote:
> Ronald Maas (rmaas.delete@this.wiwo.nl) on November 24, 2014 7:13 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on November 23, 2014 11:24 am wrote:
> > > Apart of disadvantages, both 68K and VAX shared one advantage over x86 - 2-byte granularity of
> > > instructions. P6-style brute-force approach to parsing and early decoding would take relatively
> > > less hardware resources. I don't believe that it could have helped VAX, but it could make 3-way
> > > 68K feasible even in transistor budget that does not allow decent decoded instruction cache.
> > >
> > >
> >
> > A huge benefit of the x86 instruction encoding scheme is
> > that it allows determination of the instruction length
> > by inspecting only the first 1, 2 or 3 bytes of the instruction
> > (not counting any prefixes). The only exception
> > is when some length-changing prefixes are used such as
> > address size prefix and operand size prefix. When the
> > processor encounters these prefixes in the instruction streams, it is not able to decode these instructions
> > in a single cycle anymore. Search for LCP in the Intel Optimization Reference Manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
> >
> > With 68K and VAX often the whole instruction must be parsed in order to determine its length. Which would
> > significantly increase the technical complexity required for building a superscalar instruction decoder.
> >
> > Just to illustrate here a fragment from a document "Another
> > Approach to Instruction Set Architecture — VAX" :
> >
> > How long is the following instruction?
> >
> >
> >
> > The name addl3 means a 32-bit add instruction with three operands. Assume the length of the VAX opcode is 1
> > byte. The first operand specifier — r1 — indicates register
> > addressing and is 1 byte long. The second operand
> > specifier — 737(r2) — indicates displacement addressing
> > and has two parts: the first part is a byte that
> > specifies the word-displacement addressing mode and base register
> > (r2); the second part is the 2-byte long displacement
> > (737). The third operand specifier — (r3)[r4] — also has
> > two parts: the first byte specifies register deferred
> > addressing mode ((r3)), and the second byte specifies the
> > Index register and the use of indexed addressing ([r4]).
> > Thus, the total length of the instruction is 1 + (1) + (1+2) + (1+1) = 7 bytes.
> >
> > Ronald
>
> Think about all instructions that would in PPRO be SINGLE uop how complicated
> is THEIR length calculation in VAX? Remember PPRO didn't handle more than 1 complex
> and less than 1 really complex instructions per cycle in front end.
>
> So simple,simple,complex, combo wouldn't be harder in vax than x86. And then the
> lets stall the front end to handle REALLY complex instruction from microcode case
> that pentium pro had, and there is no reason why VAX couldn't make same choice.
>
> I think following could be what parallel vax decoder would of looked for era after
> VAX was cancelled, and its similar to what X86 does but modified for VAX:s orthogonality.
> Most of the work could be moved to instruction cache fill.
>
> -------------------
> Lets assume every byte is a addressing mode specifier and decode that.
> Lets assume every byte(pair) is instruction opcode decode that.
> ---------------------
> Go through dependency tree from first instruction to last to pick which uops to put forward, Counting operands
> and going forward amount of bytes specified by decoded addressing mode specifier. Register specifiers are
> collapsed to opcode directly, memory operands write temporary register and are separate uop.
> ---------------------------------------------
> Handling the case in which front end produces too many uops for scheduler to take in a cycle. Forward
> the operand getting operands from memory uops, keep the instruction opcode waiting for additional
> operands if instruction splits to next fetch waiting for additional operands to finish.
> And going for micro code for really complex cases.
> --------------------------------------------------
So if I understand you correctly, you want to decode each part of the instruction in separate decoders parallelly. E.g. 1 decoder for opcode, 2nd decoder for operand specifier 1, 3rd decoder for operand specifier 2.
Any thoughts how many decoders would be needed to approach the performance of PPro? Then do you want to have decoders that can only decode opcodes or operand specifiers, or decoders that can do both. Note that the operand specifiers per VAX instruction can vary from 0 to 6 according to http://www.cs.ccsu.edu/~kjell/cs254/ch08/ch8_11.html
By the way, I do think it makes sense to do it the way you suggested.
Ronald
> Ronald Maas (rmaas.delete@this.wiwo.nl) on November 24, 2014 7:13 pm wrote:
> > Michael S (already5chosen.delete@this.yahoo.com) on November 23, 2014 11:24 am wrote:
> > > Apart of disadvantages, both 68K and VAX shared one advantage over x86 - 2-byte granularity of
> > > instructions. P6-style brute-force approach to parsing and early decoding would take relatively
> > > less hardware resources. I don't believe that it could have helped VAX, but it could make 3-way
> > > 68K feasible even in transistor budget that does not allow decent decoded instruction cache.
> > >
> > >
> >
> > A huge benefit of the x86 instruction encoding scheme is
> > that it allows determination of the instruction length
> > by inspecting only the first 1, 2 or 3 bytes of the instruction
> > (not counting any prefixes). The only exception
> > is when some length-changing prefixes are used such as
> > address size prefix and operand size prefix. When the
> > processor encounters these prefixes in the instruction streams, it is not able to decode these instructions
> > in a single cycle anymore. Search for LCP in the Intel Optimization Reference Manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
> >
> > With 68K and VAX often the whole instruction must be parsed in order to determine its length. Which would
> > significantly increase the technical complexity required for building a superscalar instruction decoder.
> >
> > Just to illustrate here a fragment from a document "Another
> > Approach to Instruction Set Architecture — VAX" :
> >
> > How long is the following instruction?
> >
> >
addl3 r1, 737(r2), (r3)[r4]
> >
> > The name addl3 means a 32-bit add instruction with three operands. Assume the length of the VAX opcode is 1
> > byte. The first operand specifier — r1 — indicates register
> > addressing and is 1 byte long. The second operand
> > specifier — 737(r2) — indicates displacement addressing
> > and has two parts: the first part is a byte that
> > specifies the word-displacement addressing mode and base register
> > (r2); the second part is the 2-byte long displacement
> > (737). The third operand specifier — (r3)[r4] — also has
> > two parts: the first byte specifies register deferred
> > addressing mode ((r3)), and the second byte specifies the
> > Index register and the use of indexed addressing ([r4]).
> > Thus, the total length of the instruction is 1 + (1) + (1+2) + (1+1) = 7 bytes.
> >
> > Ronald
>
> Think about all instructions that would in PPRO be SINGLE uop how complicated
> is THEIR length calculation in VAX? Remember PPRO didn't handle more than 1 complex
> and less than 1 really complex instructions per cycle in front end.
>
> So simple,simple,complex, combo wouldn't be harder in vax than x86. And then the
> lets stall the front end to handle REALLY complex instruction from microcode case
> that pentium pro had, and there is no reason why VAX couldn't make same choice.
>
> I think following could be what parallel vax decoder would of looked for era after
> VAX was cancelled, and its similar to what X86 does but modified for VAX:s orthogonality.
> Most of the work could be moved to instruction cache fill.
>
> -------------------
> Lets assume every byte is a addressing mode specifier and decode that.
> Lets assume every byte(pair) is instruction opcode decode that.
> ---------------------
> Go through dependency tree from first instruction to last to pick which uops to put forward, Counting operands
> and going forward amount of bytes specified by decoded addressing mode specifier. Register specifiers are
> collapsed to opcode directly, memory operands write temporary register and are separate uop.
> ---------------------------------------------
> Handling the case in which front end produces too many uops for scheduler to take in a cycle. Forward
> the operand getting operands from memory uops, keep the instruction opcode waiting for additional
> operands if instruction splits to next fetch waiting for additional operands to finish.
> And going for micro code for really complex cases.
> --------------------------------------------------
So if I understand you correctly, you want to decode each part of the instruction in separate decoders parallelly. E.g. 1 decoder for opcode, 2nd decoder for operand specifier 1, 3rd decoder for operand specifier 2.
Any thoughts how many decoders would be needed to approach the performance of PPro? Then do you want to have decoders that can only decode opcodes or operand specifiers, or decoders that can do both. Note that the operand specifiers per VAX instruction can vary from 0 to 6 according to http://www.cs.ccsu.edu/~kjell/cs254/ch08/ch8_11.html
By the way, I do think it makes sense to do it the way you suggested.
Ronald