Thanks :)

Article: Intel's Haswell CPU Microarchitecture
By: Alberto (, September 2, 2014 1:42 am
Room: Moderated Discussions
Maynard Handley ( on September 1, 2014 3:51 pm wrote:
> Sireesh ( on September 1, 2014 2:48 pm wrote:
> > David Kanter ( on November 13, 2012 3:43 pm wrote:
> > > Intel’s Haswell CPU is the first core optimized for 22nm and includes a huge number of innovations
> > > for developers and users. New instructions for transactional memory, bit-manipulation, full 256-bit
> > > integer SIMD and floating point multiply-accumulate are combined in a microarchitecture that essentially
> > > doubles computational throughput and cache bandwidth. Most importantly, the microarchitecture
> > > was designed for efficiency and extends Intel’s offerings down to 10W tablets, while maintaining
> > > leadership for notebooks, desktops, servers and workstations.
> > >
> > >
> > >
> > > As always, comments, questions and feedback are encouraged.
> > >
> > > DK
> >
> > I'm new to this field and this question may sound stupid to you. I'm little confused with the terminology
> > here. According to what I've studied in our basic computer architecture course, instructions after
> > renaming enters issue queue. Issue queue contains the tags of source operands, valid and ready
> > bits. When all the source operands are available, instruction is steered into execution units through
> > dispatch ports. But, in this article, it is mentioned that entry reorder buffer has the operand
> > status information. What exactly is the job of 60 entry instruction scheduler then? Why can't the
> > instructions directly sent from entry reorder buffer to execution units?
> >
> There are two conceptually different problems that the CPU is
> solving, and it uses two different queues to solve them.
> Problem one is speculation/recovery. The CPU engages in branch prediction (and perhaps other forms of speculation).
> When it guesses incorrectly, it needs to recover. The ROB (reorder buffer) solves this problem. It holds
> instructions which have started down the path of execution but for which the CPU is not yet CERTAIN that
> the instruction is non-speculative. So, for example, imagine a load which misses in cache, then a branch
> which depends on that load, then a whole lot of other instructions. Every instruction after the branch is
> speculative. They will all pile up in the ROB and, if none of them depends on the load, then they may all
> have finished execution. They are almost done, but we can't be CERTAIN that they're the correct instruction
> until we're certain the branch went in the direction we speculated --- so the instructions have to sit in
> the ROB and can't yet commit their values to the permanent state of the machine.
> Once we know the branch went in the right direction, then we can go through the various instructions
> in the ROB after the branch and commit their state (for example, write out their value to an architected
> register, or store pending writes to the real memory system instead of keeping them in a buffer.)
> If, on the other hand, we guessed the branch direction incorrectly, then we flush
> everything in the ROB,no permanent state (memory system, architected register)
> have changed, and we reload instructions from the correct branch direction.
> A larger ROB is nice primarily because it means you can keep running instructions for
> longer after you miss a load in cache, and that in turn means you're more likely to generate
> more memory level parallelism and so, effectively, prefetch future loads.
> There is nothing especially difficult about building a large ROB --- current machines tend
> to have a value around 200 instructions long. So why not make it a thousand, or two thousand,
> instructions long and be able to generate all the memory level parallelism you like?
> The problem that's ACTUALLY difficult is holding all this temporary state --- we need a lot of physical
> registers and store queue slots to hold the values that we can't yet commit, and it having lots of
> registers or lots of queue slots that's difficult. Basically the ROB length is determined by how many
> of those we have. Making it larger than is determined by these other factors would be quite feasible
> but pointless because it would generally never fill up --- instruction flow would stop before the
> ROB is full because some other resource has run out, like lack of physical registers.
> Problem two is executing instructions out of order. We want to have a pool of instructions that are
> potentially executable, from which every cycle we execute those that have their values available.
> The larger this pool is, the more often we can keep doing useful work even if we're waiting on data
> (maybe a load from L1 or L2, maybe we're waiting on a 5 cycle FP result, maybe we have a multi cluster
> system and are waiting for a value to propagate from the other cluster). This pool of instructions
> is the issue queue. On current high end machines it tends to be about 60 instructions.
> A larger pool would be nice, but growing this pool larger is tough because you need to do a whole
> lot of work in one cycle for optimal performance --- essentially you want to, in one cycle, scan
> for the earliest values that have all their operands available, and that have an execution unit free,
> and respect dependencies, and handle forwarding results from previous executions and so on.
> You can split this work into two cycles, but that REALLY hurts performance because it hurts back to back execution
> of simple arithmetic. So you're better off accepting a smaller issue queue and a faster cycle time.
> Essentially where the larger issue queue helps is code that's something like a long loop of
> dependent instructions. If, say, the loop generates 50 successively dependent instructions,
> then we can keep pumping in new (independent) instructions into the ten slots at the end, and
> executing some of those independent instruction along with one of the dependent instructions
> each cycle. If the loop generates say 200 successive dependent instructions, then we're screwed
> --- until we get to the last 60 instructions, the first 140 will execute one at a time.
> This is obviously an argument for a larger issue queue, but the problem is that code tends to
> be very bursty in these dependency chains --- if the bulk of these chains is say 500 instructions
> long, then it's not going to matter much if our queue is 30 instruction long or 60 long. So
> basically you make the queue as big as you can given cycle and power limitations, and you accept
> that it helps most of the time, and will be overwhelmed by the occasional loop.
> To to summarize --- essentially these are two distinct "data structures"
> because they're specialized for different tasks.
> One is a queue --- instructions enter and leave in FIFO order --- and its primary constraint
> is to be as large as it can be (given what I said about physical registers and the store
> queue) while taking as little area and power as possible. A secondary constraint is the
> details of how you handle branch misprediction and so how you recover. There are different
> options available which differ in their performance, their power, and their area.
> The other is a godawfully complicated thing that's constantly shuffling the instructions while
> also scanning and updating them. It's going to burn a hell of a lot of power, it's going to limit
> your cycle time, and you just pay whatever the area costs are to get the damn thing to work.

Many thanks Maynard, great post. It is very constructive and gives a lot of informations even to posters like me that "work all day" with cpus for mechanical engineering but have not a strong background in informatic and EE in general.

I have missed some historic posters in this forum, with their amazing and understandable answers to questions and topics, with their no self-celebration posts and their good sense of humor. RWT is sometimes a little elitary lately and this is not a good sign because the "users" and their doubts, in these strange days, are more important than pretty useless discussions between "experts" that care nothing about the readability of their responses to a wider audience (who buys). There are other places on the web to do discussions between colleagues.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Haswell CPU article onlineDavid Kanter2012/11/13 03:43 PM
  Haswell CPU article onlineEric2012/11/13 04:10 PM
    Haswell CPU article onlinehobold2012/11/13 05:13 PM
      Haswell CPU article onlineRicardo B2012/11/13 06:09 PM
    Haswell CPU article onlineanonymou52012/11/13 05:44 PM
      Haswell CPU article onlinenone2012/11/14 03:40 AM
  Haswell CPU article onlinetarlinian2012/11/13 04:56 PM
    Fixed (NT)David Kanter2012/11/13 06:06 PM
      Haswell CPU article onlineJacob Marley2012/11/14 02:18 AM
  Haswell CPU article onlinerandomshinichi2012/11/14 02:53 AM
    LLC == Last Level Cache (usually L3) (NT)Paul A. Clayton2012/11/14 05:50 AM
    Haswell CPU article onlineJoe2012/11/14 10:38 AM
      LLC vs. L3 vs. L4David Kanter2012/11/14 11:09 AM
        LLC vs. L3 vs. L4; LLC = Link Layer ControllerRay2012/11/14 10:08 PM
          A pit there are only 17000 TLAs... (NT)EduardoS2012/11/15 03:14 AM
  Haswell CPU article onlineanon2012/11/14 05:10 AM
    Move elimination can be a µop fusionPaul A. Clayton2012/11/14 06:41 AM
      That should be "mov R10 <- R9"! (NT)Paul A. Clayton2012/11/14 06:43 AM
      Move elimination can be a µop fusionanon2012/11/14 07:25 AM
        It does avoid the scheduler (NT)Paul A. Clayton2012/11/14 08:47 AM
      Move elimination can be a µop fusionStubabe2012/11/14 01:43 PM
        Move elimination can be a µop fusionanon2012/11/14 09:33 PM
          Move elimination can be a µop fusionFelid2012/11/15 12:49 AM
            Move elimination can be a µop fusionanon2012/11/15 01:23 AM
              Move elimination can be a µop fusionStuart2012/11/15 05:04 AM
                Move elimination can be a µop fusionStubabe2012/11/15 05:14 AM
                  Move elimination can be a µop fusionanon2012/11/15 05:48 AM
                    Move elimination can be a µop fusionEduardoS2012/11/15 06:00 AM
                      Move elimination can be a µop fusionanon2012/11/15 06:14 AM
                        Move elimination can be a µop fusionEduardoS2012/11/15 06:21 AM
                          Move elimination can be a µop fusionanon2012/11/15 06:31 AM
                    Move elimination can be a µop fusionStubabe2012/11/15 11:38 AM
                      There can be only one dependencePaul A. Clayton2012/11/15 12:50 PM
                    Move elimination can be a µop fusionFelid2012/11/15 03:19 PM
                      Move elimination can be a µop fusionanon2012/11/16 04:07 AM
                        Move elimination can be a µop fusionFelid2012/11/16 07:43 PM
                  Move elimination can be a µop fusionFelid2012/11/15 02:50 PM
                    Move elimination can be a µop fusionFelid2012/11/15 03:03 PM
                      Correction!Felid2012/11/19 01:23 AM
                    Thanks, I wasn't aware of the change in SB. Good to know... (NT)Stubabe2012/11/15 03:43 PM
            Move fusion assumes adjacencyPaul A. Clayton2012/11/15 07:15 AM
              Move fusion assumes adjacencyFelid2012/11/15 02:40 PM
        Move elimination can be a µop fusionPatrick Chase2012/11/21 11:52 AM
          Move elimination can be a µop fusionPatrick Chase2012/11/21 12:12 PM
    Haswell CPU article onlineRicardo B2012/11/14 09:12 AM
  Haswell CPU article onlinegmb2012/11/14 08:28 AM
  Haswell CPU article onlineFelid2012/11/14 11:58 PM
    Haswell CPU article onlineDavid Kanter2012/11/15 09:59 AM
      Haswell CPU article onlineFelid2012/11/15 02:15 PM
        Instruction queueDavid Kanter2012/11/16 12:23 PM
          Instruction queueFelid2012/11/16 01:05 PM
  128-bit division unit?Eric Bron2012/11/16 04:57 AM
    128-bit division unit?David Kanter2012/11/16 08:59 AM
      128-bit division unit?Eric Bron2012/11/16 09:47 AM
        128-bit division unit?Felid2012/11/16 12:46 PM
          128-bit division unit?Eric Bron2012/11/16 01:24 PM
            128-bit division unit?Felid2012/11/16 07:19 PM
              128-bit division unit?Eric Bron2012/11/18 08:41 AM
            128-bit division unit?Michael S2012/11/17 12:50 PM
              128-bit division unit?Felid2012/11/17 01:44 PM
                128-bit division unit?Michael S2012/11/17 02:45 PM
                  128-bit division unit?Felid2012/11/17 05:49 PM
                    128-bit division unit?Michael S2012/11/17 06:56 PM
              128-bit division unit?Eric Bron2012/11/18 08:35 AM
  Haswell CPU article onlineJim F2012/11/18 09:45 AM
    Haswell CPU article onlineGabriele Svelto2012/11/18 12:52 PM
  Probable bottleneckLaurent Birtz2012/11/23 01:45 PM
    Probable bottleneckEduardoS2012/11/23 01:58 PM
      Probable bottleneckLaurent Birtz2012/11/24 10:10 AM
    Probable bottleneckStubabe2012/11/25 03:08 AM
      Probable bottleneckEduardoS2012/11/25 08:15 AM
        Probable bottleneckStubabe2012/11/28 04:36 PM
          Urgh. Post got mangled by LESS THAN signStubabe2012/11/28 04:41 PM
          Probable bottleneckLaurent Birtz2012/11/29 08:34 AM
  Haswell CPU article onlineMr. Camel2012/11/28 03:47 PM
    Haswell CPU article onlineEduardoS2012/11/28 04:06 PM
      Haswell CPU article onlineMr. Camel2012/11/28 07:23 PM
        Haswell CPU article onlineEduardoS2012/11/28 07:27 PM
          Haswell CPU article onlineMr. Camel2012/12/12 01:39 PM
            Much faster iGPU clock ...Mark Roulo2012/12/12 03:53 PM
              Much faster iGPU clock ...Exophase2012/12/12 11:46 PM
                Much faster iGPU clock ... or not :-)Mark Roulo2012/12/13 09:11 AM
                  Much faster iGPU clock ... or not :-)EduardoS2012/12/13 10:38 PM
                    Much faster iGPU clock ... or not :-)Michael S2012/12/14 05:33 AM
                      Much faster iGPU clock ... or not :-)EduardoS2012/12/14 07:06 AM
                        Much faster iGPU clock ... or not :-)Doug S2012/12/14 12:13 PM
                          Much faster iGPU clock ... or not :-)EduardoS2012/12/14 12:43 PM
                  Much faster iGPU clock ... or not :-)Mr. Camel2012/12/14 10:50 AM
              Much faster iGPU clock ...Michael S2012/12/13 02:44 AM
                Much faster iGPU clock ...Mark Roulo2012/12/13 09:09 AM
  Haswell CPU article onlineYang2012/12/09 08:28 PM
    possible spam bot? (NT)I.S.T.2012/12/10 03:40 PM
  CPU Crystal Well behavior w/ eGPU?Robert Williams2013/04/17 02:16 PM
    CPU Crystal Well behavior w/ eGPU?Nicolas Capens2013/04/17 03:30 PM
      CPU Crystal Well behavior w/ eGPU?RecessionCone2013/04/17 04:20 PM
        CPU Crystal Well behavior w/ eGPU?Robert Williams2013/04/17 07:37 PM
    CPU Crystal Well behavior w/ eGPU?Eric Bron2013/04/17 09:10 PM
  Haswell CPU article onlineSireesh2014/09/01 02:48 PM
    Haswell CPU article onlineMaynard Handley2014/09/01 03:51 PM
      Great postDavid Kanter2014/09/01 07:12 PM
      Thanks :)Alberto2014/09/02 01:42 AM
      Thanks (NT)Poindexter2014/09/02 09:31 AM
    Haswell CPU article onlineEduardoS2014/09/01 04:21 PM
  Haswell CPU article onlineAlbert2015/10/06 01:48 AM
    Haswell CPU article onlineMichael S2015/10/06 02:10 AM
    Haswell CPU article onlineSHK2015/10/06 03:51 AM
Reply to this Topic
Body: No Text
How do you spell green?