Great post

Article: Intel's Haswell CPU Microarchitecture
By: David Kanter (, September 1, 2014 7:12 pm
Room: Moderated Discussions
Maynard Handley ( on September 1, 2014 3:51 pm wrote:
> Sireesh ( on September 1, 2014 2:48 pm wrote:
> > David Kanter ( on November 13, 2012 3:43 pm wrote:
> > > Intel’s Haswell CPU is the first core optimized for 22nm and includes a huge number of innovations
> > > for developers and users. New instructions for transactional memory, bit-manipulation, full 256-bit
> > > integer SIMD and floating point multiply-accumulate are combined in a microarchitecture that essentially
> > > doubles computational throughput and cache bandwidth. Most importantly, the microarchitecture
> > > was designed for efficiency and extends Intel’s offerings down to 10W tablets, while maintaining
> > > leadership for notebooks, desktops, servers and workstations.
> > >
> > >
> > >
> > > As always, comments, questions and feedback are encouraged.
> > >
> > > DK
> >
> > I'm new to this field and this question may sound stupid to you. I'm little confused with the terminology
> > here. According to what I've studied in our basic computer architecture course, instructions after
> > renaming enters issue queue. Issue queue contains the tags of source operands, valid and ready
> > bits. When all the source operands are available, instruction is steered into execution units through
> > dispatch ports. But, in this article, it is mentioned that entry reorder buffer has the operand
> > status information. What exactly is the job of 60 entry instruction scheduler then? Why can't the
> > instructions directly sent from entry reorder buffer to execution units?
> >
> There are two conceptually different problems that the CPU is
> solving, and it uses two different queues to solve them.
> Problem one is speculation/recovery. The CPU engages in branch prediction (and perhaps other forms of speculation).
> When it guesses incorrectly, it needs to recover. The ROB (reorder buffer) solves this problem. It holds
> instructions which have started down the path of execution but for which the CPU is not yet CERTAIN that
> the instruction is non-speculative. So, for example, imagine a load which misses in cache, then a branch
> which depends on that load, then a whole lot of other instructions. Every instruction after the branch is
> speculative. They will all pile up in the ROB and, if none of them depends on the load, then they may all
> have finished execution. They are almost done, but we can't be CERTAIN that they're the correct instruction
> until we're certain the branch went in the direction we speculated --- so the instructions have to sit in
> the ROB and can't yet commit their values to the permanent state of the machine.
> Once we know the branch went in the right direction, then we can go through the various instructions
> in the ROB after the branch and commit their state (for example, write out their value to an architected
> register, or store pending writes to the real memory system instead of keeping them in a buffer.)
> If, on the other hand, we guessed the branch direction incorrectly, then we flush
> everything in the ROB,no permanent state (memory system, architected register)
> have changed, and we reload instructions from the correct branch direction.
> A larger ROB is nice primarily because it means you can keep running instructions for
> longer after you miss a load in cache, and that in turn means you're more likely to generate
> more memory level parallelism and so, effectively, prefetch future loads.
> There is nothing especially difficult about building a large ROB --- current machines tend
> to have a value around 200 instructions long. So why not make it a thousand, or two thousand,
> instructions long and be able to generate all the memory level parallelism you like?
> The problem that's ACTUALLY difficult is holding all this temporary state --- we need a lot of physical
> registers and store queue slots to hold the values that we can't yet commit, and it having lots of
> registers or lots of queue slots that's difficult. Basically the ROB length is determined by how many
> of those we have. Making it larger than is determined by these other factors would be quite feasible
> but pointless because it would generally never fill up --- instruction flow would stop before the
> ROB is full because some other resource has run out, like lack of physical registers.
> Problem two is executing instructions out of order. We want to have a pool of instructions that are
> potentially executable, from which every cycle we execute those that have their values available.
> The larger this pool is, the more often we can keep doing useful work even if we're waiting on data
> (maybe a load from L1 or L2, maybe we're waiting on a 5 cycle FP result, maybe we have a multi cluster
> system and are waiting for a value to propagate from the other cluster). This pool of instructions
> is the issue queue. On current high end machines it tends to be about 60 instructions.
> A larger pool would be nice, but growing this pool larger is tough because you need to do a whole
> lot of work in one cycle for optimal performance --- essentially you want to, in one cycle, scan
> for the earliest values that have all their operands available, and that have an execution unit free,
> and respect dependencies, and handle forwarding results from previous executions and so on.
> You can split this work into two cycles, but that REALLY hurts performance because it hurts back to back execution
> of simple arithmetic. So you're better off accepting a smaller issue queue and a faster cycle time.
> Essentially where the larger issue queue helps is code that's something like a long loop of
> dependent instructions. If, say, the loop generates 50 successively dependent instructions,
> then we can keep pumping in new (independent) instructions into the ten slots at the end, and
> executing some of those independent instruction along with one of the dependent instructions
> each cycle. If the loop generates say 200 successive dependent instructions, then we're screwed
> --- until we get to the last 60 instructions, the first 140 will execute one at a time.
> This is obviously an argument for a larger issue queue, but the problem is that code tends to
> be very bursty in these dependency chains --- if the bulk of these chains is say 500 instructions
> long, then it's not going to matter much if our queue is 30 instruction long or 60 long. So
> basically you make the queue as big as you can given cycle and power limitations, and you accept
> that it helps most of the time, and will be overwhelmed by the occasional loop.
> To to summarize --- essentially these are two distinct "data structures"
> because they're specialized for different tasks.
> One is a queue --- instructions enter and leave in FIFO order --- and its primary constraint
> is to be as large as it can be (given what I said about physical registers and the store
> queue) while taking as little area and power as possible. A secondary constraint is the
> details of how you handle branch misprediction and so how you recover. There are different
> options available which differ in their performance, their power, and their area.
> The other is a godawfully complicated thing that's constantly shuffling the instructions while
> also scanning and updating them. It's going to burn a hell of a lot of power, it's going to limit
> your cycle time, and you just pay whatever the area costs are to get the damn thing to work.

Maynard, that was a great response - thanks!

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Haswell CPU article onlineDavid Kanter2012/11/13 03:43 PM
  Haswell CPU article onlineEric2012/11/13 04:10 PM
    Haswell CPU article onlinehobold2012/11/13 05:13 PM
      Haswell CPU article onlineRicardo B2012/11/13 06:09 PM
    Haswell CPU article onlineanonymou52012/11/13 05:44 PM
      Haswell CPU article onlinenone2012/11/14 03:40 AM
  Haswell CPU article onlinetarlinian2012/11/13 04:56 PM
    Fixed (NT)David Kanter2012/11/13 06:06 PM
      Haswell CPU article onlineJacob Marley2012/11/14 02:18 AM
  Haswell CPU article onlinerandomshinichi2012/11/14 02:53 AM
    LLC == Last Level Cache (usually L3) (NT)Paul A. Clayton2012/11/14 05:50 AM
    Haswell CPU article onlineJoe2012/11/14 10:38 AM
      LLC vs. L3 vs. L4David Kanter2012/11/14 11:09 AM
        LLC vs. L3 vs. L4; LLC = Link Layer ControllerRay2012/11/14 10:08 PM
          A pit there are only 17000 TLAs... (NT)EduardoS2012/11/15 03:14 AM
  Haswell CPU article onlineanon2012/11/14 05:10 AM
    Move elimination can be a µop fusionPaul A. Clayton2012/11/14 06:41 AM
      That should be "mov R10 <- R9"! (NT)Paul A. Clayton2012/11/14 06:43 AM
      Move elimination can be a µop fusionanon2012/11/14 07:25 AM
        It does avoid the scheduler (NT)Paul A. Clayton2012/11/14 08:47 AM
      Move elimination can be a µop fusionStubabe2012/11/14 01:43 PM
        Move elimination can be a µop fusionanon2012/11/14 09:33 PM
          Move elimination can be a µop fusionFelid2012/11/15 12:49 AM
            Move elimination can be a µop fusionanon2012/11/15 01:23 AM
              Move elimination can be a µop fusionStuart2012/11/15 05:04 AM
                Move elimination can be a µop fusionStubabe2012/11/15 05:14 AM
                  Move elimination can be a µop fusionanon2012/11/15 05:48 AM
                    Move elimination can be a µop fusionEduardoS2012/11/15 06:00 AM
                      Move elimination can be a µop fusionanon2012/11/15 06:14 AM
                        Move elimination can be a µop fusionEduardoS2012/11/15 06:21 AM
                          Move elimination can be a µop fusionanon2012/11/15 06:31 AM
                    Move elimination can be a µop fusionStubabe2012/11/15 11:38 AM
                      There can be only one dependencePaul A. Clayton2012/11/15 12:50 PM
                    Move elimination can be a µop fusionFelid2012/11/15 03:19 PM
                      Move elimination can be a µop fusionanon2012/11/16 04:07 AM
                        Move elimination can be a µop fusionFelid2012/11/16 07:43 PM
                  Move elimination can be a µop fusionFelid2012/11/15 02:50 PM
                    Move elimination can be a µop fusionFelid2012/11/15 03:03 PM
                      Correction!Felid2012/11/19 01:23 AM
                    Thanks, I wasn't aware of the change in SB. Good to know... (NT)Stubabe2012/11/15 03:43 PM
            Move fusion assumes adjacencyPaul A. Clayton2012/11/15 07:15 AM
              Move fusion assumes adjacencyFelid2012/11/15 02:40 PM
        Move elimination can be a µop fusionPatrick Chase2012/11/21 11:52 AM
          Move elimination can be a µop fusionPatrick Chase2012/11/21 12:12 PM
    Haswell CPU article onlineRicardo B2012/11/14 09:12 AM
  Haswell CPU article onlinegmb2012/11/14 08:28 AM
  Haswell CPU article onlineFelid2012/11/14 11:58 PM
    Haswell CPU article onlineDavid Kanter2012/11/15 09:59 AM
      Haswell CPU article onlineFelid2012/11/15 02:15 PM
        Instruction queueDavid Kanter2012/11/16 12:23 PM
          Instruction queueFelid2012/11/16 01:05 PM
  128-bit division unit?Eric Bron2012/11/16 04:57 AM
    128-bit division unit?David Kanter2012/11/16 08:59 AM
      128-bit division unit?Eric Bron2012/11/16 09:47 AM
        128-bit division unit?Felid2012/11/16 12:46 PM
          128-bit division unit?Eric Bron2012/11/16 01:24 PM
            128-bit division unit?Felid2012/11/16 07:19 PM
              128-bit division unit?Eric Bron2012/11/18 08:41 AM
            128-bit division unit?Michael S2012/11/17 12:50 PM
              128-bit division unit?Felid2012/11/17 01:44 PM
                128-bit division unit?Michael S2012/11/17 02:45 PM
                  128-bit division unit?Felid2012/11/17 05:49 PM
                    128-bit division unit?Michael S2012/11/17 06:56 PM
              128-bit division unit?Eric Bron2012/11/18 08:35 AM
  Haswell CPU article onlineJim F2012/11/18 09:45 AM
    Haswell CPU article onlineGabriele Svelto2012/11/18 12:52 PM
  Probable bottleneckLaurent Birtz2012/11/23 01:45 PM
    Probable bottleneckEduardoS2012/11/23 01:58 PM
      Probable bottleneckLaurent Birtz2012/11/24 10:10 AM
    Probable bottleneckStubabe2012/11/25 03:08 AM
      Probable bottleneckEduardoS2012/11/25 08:15 AM
        Probable bottleneckStubabe2012/11/28 04:36 PM
          Urgh. Post got mangled by LESS THAN signStubabe2012/11/28 04:41 PM
          Probable bottleneckLaurent Birtz2012/11/29 08:34 AM
  Haswell CPU article onlineMr. Camel2012/11/28 03:47 PM
    Haswell CPU article onlineEduardoS2012/11/28 04:06 PM
      Haswell CPU article onlineMr. Camel2012/11/28 07:23 PM
        Haswell CPU article onlineEduardoS2012/11/28 07:27 PM
          Haswell CPU article onlineMr. Camel2012/12/12 01:39 PM
            Much faster iGPU clock ...Mark Roulo2012/12/12 03:53 PM
              Much faster iGPU clock ...Exophase2012/12/12 11:46 PM
                Much faster iGPU clock ... or not :-)Mark Roulo2012/12/13 09:11 AM
                  Much faster iGPU clock ... or not :-)EduardoS2012/12/13 10:38 PM
                    Much faster iGPU clock ... or not :-)Michael S2012/12/14 05:33 AM
                      Much faster iGPU clock ... or not :-)EduardoS2012/12/14 07:06 AM
                        Much faster iGPU clock ... or not :-)Doug S2012/12/14 12:13 PM
                          Much faster iGPU clock ... or not :-)EduardoS2012/12/14 12:43 PM
                  Much faster iGPU clock ... or not :-)Mr. Camel2012/12/14 10:50 AM
              Much faster iGPU clock ...Michael S2012/12/13 02:44 AM
                Much faster iGPU clock ...Mark Roulo2012/12/13 09:09 AM
  Haswell CPU article onlineYang2012/12/09 08:28 PM
    possible spam bot? (NT)I.S.T.2012/12/10 03:40 PM
  CPU Crystal Well behavior w/ eGPU?Robert Williams2013/04/17 02:16 PM
    CPU Crystal Well behavior w/ eGPU?Nicolas Capens2013/04/17 03:30 PM
      CPU Crystal Well behavior w/ eGPU?RecessionCone2013/04/17 04:20 PM
        CPU Crystal Well behavior w/ eGPU?Robert Williams2013/04/17 07:37 PM
    CPU Crystal Well behavior w/ eGPU?Eric Bron2013/04/17 09:10 PM
  Haswell CPU article onlineSireesh2014/09/01 02:48 PM
    Haswell CPU article onlineMaynard Handley2014/09/01 03:51 PM
      Great postDavid Kanter2014/09/01 07:12 PM
      Thanks :)Alberto2014/09/02 01:42 AM
      Thanks (NT)Poindexter2014/09/02 09:31 AM
    Haswell CPU article onlineEduardoS2014/09/01 04:21 PM
  Haswell CPU article onlineAlbert2015/10/06 01:48 AM
    Haswell CPU article onlineMichael S2015/10/06 02:10 AM
    Haswell CPU article onlineSHK2015/10/06 03:51 AM
Reply to this Topic
Body: No Text
How do you spell avocado?