ISA support for constant count loops: ineffective compared to micro-threads

By: Etienne (etienne_lorrain.delete@this.yahoo.fr), January 22, 2020 4:35 pm
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on January 22, 2020 2:28 pm wrote:
> Heikki Kultala (heikki.kultala.delete@this.tuni.fi) on January 22, 2020 9:47 am wrote:
> > ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on January 22, 2020 8:18 am wrote:
> >
> > > > Note that with Out Of Order processor you may already treat few iteration in parallel right now...
> > >
> > > Only if you can statically (at compile-time) prove that the few iterations are data independent.
> >
> > No.
> >
> > OoOE works fully dynamically. The processor fetches multiple iterations of the loop (later
> > iterations of the micro-op cache) and instructions from ANY iteration can execute.
> >
> > There is no need to prove ANYTHING at compile time; The incoming code looks like completely serial
> > code but the processor extracts huge amount of parallelism from it fully dynamically.
> >
> > If the processor makes some wrong guess like branch misprediction or incorrectly predicted
> > memory aliasing, then it just reverts the "last known correct state".
>
> I think we mutually agree that software can for example choose the 1st instruction (X1)
> from the 1st 4 iterations (I1-I4) of the loop, which can be written as tuple (I1.X1, I2.X1,
> I3.X1, I4.X1), and thus can expect a 4-wide CPU to begin executing those 1st 4 loop iterations
> in parallel. This can continue with the 2nd instruction (X2), 3rd, and so on, until a branch
> instruction is encountered that will cause the 4 iterations to diverge.
>
> The tuple can only be constructed if the 4 incarnations of X1 can at compile-time
> be proved to be data independent (i.e: I2.X1 does not depend on the whole
> I1; I3.X1 does not depend on the whole I1 and the whole I2, etc).
>
> An issue with the approach in the above paragraph is that the number
> of registers available to the compiler is divided by about 4.

What you are describing is compile time parallelism extraction, it has already been tested and has proved highly unsuccessful - the proof is you did not heard of the processors and compilers already...

> By definition (by necessity) the last instruction in any kind of loop is a conditional
> jump instruction. Current OoO [x86] CPUs are incapable of concurrently executing multiple
> conditional jump instructions on a single core - the single-threaded nature of the instruction
> stream is too deeply rooted in the basic design of any OoO [x86] CPU core.

Current OoO CPUs have 80+ instructions in flight, and can execute up to 4 IPC, i.e. 4 Instructions Per Cycle. Unconditional jumps are not a problem, conditional jumps are predicted (as-soon-as possible).

> Current processors cannot fetch multiple iterations of the loop (from L1 or from µop cache) because
> the CPU can only see at most 4 instructions per clock cycle (5 in case of a fused instruction, 6 in
> case the source is µop cache), and can resolve at most 1 conditional jump instruction per cycle.
>
> Obviously, parallel execution of 16 loop iterations in parallel would require concurrent
> execution of up to 16 conditional jump instructions per cycle - which sharply contradicts
> the fact that current x86 OoO CPUs can execute at most 1 such instruction per cycle.
>
> The idea I proposed enables a CPU to start 16 loop iterations in parallel, with the
> rather obvious requirement that the CPU implementation should be able to resolve up
> to 16 conditional instructions in parallel, and with the rather obvious benefit that
> the number of registers available to the compiler isn't divided by about 16.

A streaming processor, like on AMD/NVIDIA video card, has hundred of registers organized in banks and executes the same instruction a lot of times (in each banks) before going to the next instruction, in "waves" that cannot be described in three lines.

> It is different from multi-core because there aren't 16 copies of L1I/L1D
> caches - the L1I/L1D caches "just" need to be multi-ported.
>
> In practice, even with 16 iterations started, we would be lucky if the real-world speedup factor
> is 2 (so just a 100% speedup, not 1600%) - which is however still much larger than the speedup which
> can be expected from "ISA support for constant count loops" that started this discussion.

If you know you have exactly 8 loops to do with 8 instructions each, on a processor with 80+ instructions in flight, you can fill 64 instructions in flight immediately and even then next instructions - as soon as they are decoded...

>
> -atom

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
LLVM comments on mem*Maynard Handley2020/01/14 01:51 PM
  LLVM comments on mem*Anon32020/01/15 06:28 AM
  Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/15 07:12 AM
    Interesting comment about rep instructions & code sizenone2020/01/15 08:59 AM
      Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:56 AM
        Interesting comment about rep instructions & code sizeLinus Torvalds2020/01/16 10:12 AM
          ISA support for constant count loopsPaul A. Clayton2020/01/16 11:28 AM
            ISA support for constant count loopsGabriele Svelto2020/01/16 02:15 PM
              PowerPC "front-end registers"Paul A. Clayton2020/01/16 03:34 PM
              ISA support for constant count loopsTravis Downs2020/01/16 05:21 PM
                ISA support for constant count loopsLinus Torvalds2020/01/16 08:41 PM
                  ISA support for constant count loopsTravis2020/01/16 09:48 PM
                    ISA support for constant count loopsBrett2020/01/17 01:28 AM
              Branch to CTRMaya2020/01/18 08:15 AM
                Branch to CTRGabriele Svelto2020/01/18 01:14 PM
            ISA support for constant count loopsanon2020/01/17 08:28 AM
              ISA support for constant count loopsTravis Downs2020/01/17 08:34 AM
            ISA support for constant count loops: ineffective compared to micro-threads2020/01/20 08:02 AM
              ISA support for constant count loops: ineffective compared to micro-threadssomeone2020/01/20 12:23 PM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:23 AM
              ISA support for too slow computersEtienne2020/01/21 02:42 AM
                ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 09:18 AM
                  ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 10:04 AM
                  ISA support for constant count loops: ineffective compared to micro-threadsHeikki Kultala2020/01/22 10:47 AM
                    ISA support for constant count loops: ineffective compared to micro-threadsdmcq2020/01/22 01:31 PM
                    ISA support for constant count loops: ineffective compared to micro-threads2020/01/22 03:28 PM
                      ISA support for constant count loops: ineffective compared to micro-threadsEtienne2020/01/22 04:35 PM
          Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 02:00 PM
    Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 03:40 PM
      Interesting comment about rep instructions & code sizeChester2020/01/15 05:16 PM
        Interesting comment about rep instructions & code sizeTravis Downs2020/01/15 05:50 PM
          Interesting comment about rep instructions & code sizeChester2020/01/15 07:24 PM
            Interesting comment about rep instructions & code sizeTravis Downs2020/01/16 02:26 PM
              Interesting comment about rep instructions & code sizeChester2020/01/17 01:16 PM
                Interesting comment about rep instructions & code sizeTravis Downs2020/01/17 03:41 PM
        Interesting comment about rep instructions & code sizeGabriele Svelto2020/01/16 03:53 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?