Where do CPUs go next?

By: Maynard Handley (name99.delete@this.name99.org), May 12, 2019 12:38 pm
Room: Moderated Discussions
Given how dead this thread is, let me throw out a new idea. It builds on a few previous ideas that have been more or less uniformly hated by my legions of unfans, so feel free to hate this one as well...
This is something of a long play; I don't think it makes sense until the CPU end-game; but since we are approaching that, maybe we'll see it it ten years or so.

OK, consider the idea of single-ISA het cores. Right now that's, in practical terms, a synonym for big.LITTLE, but more is possible.
The most obvious use of the same idea beyond big.LITTLE is the use of two (or three or four) cores that present to the OS as a single core, and that utilize some sort of HW controller that uses some sort of info (maybe OS hints regarding priority, maybe performance counters) to move code transparently between the different cores. This APPEARS to be what's going on with the A10, though it's uncertain there the extent to which HW controls moving between the cores vs SW, and whether the single core presentation is what's visible to the OS as opposed to what's visible to non-privileged code. Either way, the two A10 cores still form a big.LITTLE type pairing.

An alternative viewpoint (which appears to be where Apple has gone with the A11 and A12, just like the rest of ARM [and now Intel with Lakefield]) is to say that moving code between such very different cores, primarily for the purpose of lower energy, is best handled by the OS, so once again visible big.LITTLE. So game over, right?

No, not at all. A different strand of thought (primarily at North Carolina State) asks the question: suppose that you're offered a process and a power budget but NOT an area budget, and told to go wild designing the most performant core you can. Given a particular corpus of code (which defines "most performant") there's more or less a single best "average" core that generates the best SPEC2006 number (or whatever).
BUT -- suppose you are allowed to design TWO cores, which present to the OS as one core, and incorporate a HW controller that (based on performance counters) decides when to (transparently) move code from one core to the other? OK, you define some sensible ground rules -- realistically how long does it take to transfer state, how often are you going to make a transfer decision, ... and start the optimization process. What comes out of it?

Well (to simplify slightly) if you allow for two different cores, then one will be a mid-range brainiac (reasonably wide, reasonably deep ROB), the other something of a speed demon (still OoO, but two-wide, smaller caches and such structures, clocked rather higher). And this sort of setup gives you about a 10% speed boost with oracle scheduling, maybe 8% with realistically possible scheduling.
If you push this further to say 4 differently optimized cores, you can get to about 15% possible with oracle, about 12% with a realistic scheduler.

Now the cost for all this, in the models NCS U uses is double or triple or quadruple the area. Which is not THAT terrible but also still not great. Can we do better?

Which makes me ask the question: why only ever use one of these n (let's say 2 for now) cores?
So what I envision is, let's say, a pair of CPUs presents to the OS as a pair of virtual CPUs, A and B. These are backed by two real CPUs, 1 and 2. A and B are identical as far as the OS (and compiler) are concerned, but 1 and 2 are different, eg one is brainiac, one is speed demon. The OS does its OS thing, scheduling code on A and B, but behind the scenes there's a performance controller that's figuring out which of 1 vs 2 is the better match for each of the two code sequences, re-considering every epoch.

Crucially, note, both 1 and 2 are GOOD CPUs. They're both exceptional at a particular type of code, but they're not terrible at their non-preferred type of code. It's not a catastrophe if brainiac-optimal code is forced to run on the speed demon CPU because both A and B are currently executing brainiac-optimal code...

So what can we say about this?
Obviously it requires more design+validation effort. You're now designing not one but two CPUs. Of course to the extent that everything is parameterized in your design and you perform massive sweeps of parameter space looking for the optimal combination, this is more work for your computers, but not that much more for the designers.
It also works best for environments that naturally present "lots" of cores, and execute lots of types of code simultaneously. So phones and desktops; servers/data warehouses seem to do more of nothing but the same thing over and over. But say a phone with 4 different big cores, that's usually running two or three threads doing different two or three types of work, might get fairly close to that 12% case even though now all cores are being used simultaneously (as opposed to the modeled case, of only ever one of the four cores being used at once); likewise a desktop with perhaps two clusters each of 4 different big cores.
And presumably it works as well with the LITTLE as the big cores, again two or four variants each optimized for performance on a particular type of code at a particular energy point.

Finally there are some obvious dimensions of difference that go beyond the brainiac/speed demon split I posited. For example on an SVE type machine, maybe have one core with wide vectors (512 or even 1024) and give the other cores short vectors (128 or 256)?

All this for a payoff of ~10% may not seem worth the effort to some; which is why I relegate it to the CPU endgame, when every other trick has hit its limit. But I would be curious as to whether anyone is already thinking this way. QC works closely with NCSU so maybe? Apple is another obvious candidate, not least given their A10 experience.
IF Apple did this, my expectation is that the first attempt would be with the small cores, splitting each pair into two cores with slightly different optimization points. I say this because people demand less of the small cores, so it's easier to give them somewhat unexpected performance characteristics (and monitor how well the HW controller and the algorithm for moving code transparently between cores works in the real world) without generating a furor.
Given that this is Apple, how would we even know, if they don't decide this is one of the few points they want to highlight at a public event? I think we should try to be more open from this point on in looking closely at cores to see whether they're visibly different. Different cache sizes, maybe even different SVE lengths, might be visible to the naked eye?
 Next Post in Thread >
TopicPosted ByDate
Where do CPUs go next?Maynard Handley2019/05/12 12:38 PM
  VISC?john2019/05/12 01:57 PM
    VISC?Maynard Handley2019/05/12 04:08 PM
      VISC?Eric Bron2019/05/13 08:34 AM
        VISC?Maynard Handley2019/05/13 09:56 AM
          VISC?Eric Bron2019/05/13 12:49 PM
            VISC?Maynard Handley2019/05/13 04:30 PM
        bigA.bigB rather than big.LITTLEPaul A. Clayton2019/05/13 10:08 AM
          bigA.bigB rather than big.LITTLEMaynard Handley2019/05/13 10:47 AM
          bigA.bigB rather than big.LITTLEEric Bron2019/05/14 02:42 AM
            nomenclaturebigA.bigB rather than big.LITTLEEric Bron2019/05/14 02:57 AM
            nomenclatureEric Bron2019/05/14 03:02 AM
  Where do CPUs go next?Seni2019/05/12 04:53 PM
    Where do CPUs go next?Maynard Handley2019/05/12 05:24 PM
      Where do CPUs go next?Seni2019/05/12 05:53 PM
  Where do CPUs go next?Paul A. Clayton2019/05/12 06:41 PM
    O.T.: SIMD control overheadAdrian2019/05/12 08:03 PM
      O.T.: Whiskey Lake turboAdrian2019/05/12 09:46 PM
        O.T.: Whiskey Lake turboYoav2019/05/13 07:38 AM
          O.T.: Whiskey Lake turboAdrian2019/05/13 08:22 AM
            O.T.: Whiskey Lake turboYoav2019/05/14 04:07 AM
      O.T.: SIMD control overheadKevin G2019/05/13 06:16 AM
        O.T.: SIMD control overheadAdrian2019/05/13 06:55 AM
          O.T.: SIMD control overheadAdrian2019/05/13 07:19 AM
          O.T.: SIMD control overheadKevin G2019/05/13 10:48 AM
            O.T.: SIMD control overheadAdrian2019/05/13 11:55 AM
            O.T.: SIMD control overheadj2019/05/14 03:13 AM
      SIMD futureJouni Osmala2019/05/13 07:47 AM
      O.T.: SIMD control overheadmpx2019/05/15 07:18 AM
    Where do CPUs go next?Maynard Handley2019/05/12 08:35 PM
      Clock speed and other issuesChester Lam2019/05/13 08:18 AM
        Clock speed and other issuesIntelUser2019/05/13 09:05 AM
          Clock speed and other issueswumpus2019/05/14 08:02 AM
            Clock speed and other issuesChester Lam2019/05/14 03:06 PM
              Clock speed and other issuesanon2019/05/14 07:29 PM
                Clock speed and other issuesChester Lam2019/05/15 01:49 AM
                  Clock speed and other issuesanon2019/05/15 02:49 AM
                    Clock speed and other issuesChester Lam2019/05/15 08:15 AM
                      Clock speed and other issuesanon2019/05/15 03:46 PM
                        L1 hit rates toochester lam2019/05/16 01:04 PM
              Clock speed and other issueswumpus2019/05/15 02:41 AM
            Clock speed and other issuesgallier22019/05/15 02:03 AM
              Clock speed and other issuesMaynard Handley2019/05/15 10:56 AM
  Where do CPUs go next?Etienne2019/05/13 12:03 AM
    Where do CPUs go next?Adrian2019/05/13 01:43 AM
    Where do CPUs go next?Foo_2019/05/13 01:54 AM
      Where do CPUs go next?Adrian2019/05/13 02:11 AM
      Where do CPUs go next?Etienne2019/05/13 02:30 AM
        Where do CPUs go next?Foo_2019/05/13 04:07 AM
          Where do CPUs go next?Etienne2019/05/13 05:36 AM
            Where do CPUs go next?Foo_2019/05/13 05:47 AM
            Where do CPUs go next?Adrian2019/05/13 06:28 AM
              Where do CPUs go next?Etienne2019/05/13 07:29 AM
                Where do CPUs go next?Adrian2019/05/13 07:58 AM
                The fallacy of average IPCanon2019/05/14 04:34 AM
                  The fallacy of average IPCnone2019/05/14 05:34 AM
  Where do CPUs go next?Robert Williams2019/05/13 08:05 AM
  Where do CPUs go next?juanrga2019/05/13 03:36 PM
    Where do CPUs go next?Maynard Handley2019/05/13 04:33 PM
      Where do CPUs go next?juanrga2019/05/13 11:13 PM
      Accelerators accelerate, not limited to throughputPaul A. Clayton2019/05/14 06:04 AM
        Accelerators accelerate, not limited to throughputMontaray Jack2019/05/15 03:52 AM
          Accelerators accelerate, not limited to throughputMaynard Handley2019/05/15 10:46 AM
            Accelerators accelerate, not limited to throughputMontaray Jack2019/05/15 11:59 PM
  Where do CPUs go next?Kevin G2019/05/14 08:58 AM
  Where do CPUs go next?dmcq2019/05/14 09:47 AM
    Where do CPUs go next?Doug S2019/05/14 02:11 PM
      Where do CPUs go next?Maynard Handley2019/05/14 03:33 PM
        Where do CPUs go next?dmcq2019/05/15 06:46 AM
          Where do CPUs go next?Simon Farnsworth2019/05/15 07:58 AM
            Where do CPUs go next?Maynard Handley2019/05/15 10:42 AM
              Where do CPUs go next?Simon Farnsworth2019/05/16 12:35 AM
              Where do CPUs go next?none2019/05/16 07:04 AM
                Where do CPUs go next?James2019/05/16 08:34 AM
            Where do CPUs go next?Andrew Clough2019/05/16 07:35 AM
              Where do CPUs go next?Simon Farnsworth2019/05/17 01:17 AM
          Where do CPUs go next?Maynard Handley2019/05/15 10:39 AM
            Where do CPUs go next?dmcq2019/05/15 02:20 PM
              Where do CPUs go next?Eric Bron2019/05/16 09:06 AM
Reply to this Topic
Body: No Text
How do you spell purple?