Ice Lake updates to optimization manual

By: Travis Downs (travis.downs.delete@this.gmail.com), October 5, 2019 6:38 pm
Room: Moderated Discussions
Intel has updated the optimization manual to reflect Ice Lake.

There's not a ton of new stuff there, and much of it was already known, but there are a few nuggets here and there. Here's a summary of what I found.

They mention again the scalar "MulHi" unit that showed up on some Sunny Cove slides and got some mentions as a possible "additional unit" in SCL. It is described as follows:


4. “MulHi” produces the upper 64 bits of the result of an iMul operation that multiplies two 64-bit registers and places the result into two 64-bits registers.


This unit has always existed at least as far back as SNB (although it was on p1 there), so I'm not sure why it is getting mentioned now. As far as I know this doesn't represent any additional capabilities (the guide also doesn't say it is new as it does for many other units).

As already revealed, there is a new 256-bit shuffle unit on port 1, described as follows:

The “Shuffle” on port 1 is new, and supports only in-lane shuffles that operate within the same 128-bit sub-vector.


The unit is less general than that: it can't even do all in-lane shuffles like vpermil* and unpack shuffles (more details). I suspect this unit is using the otherwise unused hardware in the 512-bit shuffle unit on p5. When there are no 512-bit ops in the RS, that hardware goes unused (since only 256-bit shuffle ops can go to p5). When 512-bit ops are present, p1 is not available to vector ops (since p01 are fused into one 512-bit unit), so there is no conflict with say a 512-bit shuffle from p5 and a 256-bit shuffle on p1 both trying to use the unit at once. This kind of hardware sharing was already present on p01 with 256-bit FMAs.

The guide says a feature of ICL is "Reduced effective load latency", but all the relevant latencies have stayed the same or increased. I assume by "effective" they mean the blended average latency over typical accesses, which could be lower due to larger caches, smarter prefetching, etc.

The IDQ has been increased to 70 uops from 64, and it is still a full dedicated per 70 entry per thread. That's mostly a nothing burger but since this chip has the LSD enabled again, it would matter if you care about what loops fit in the LSD or not. Here's the manual text:

The IDQ can hold 70 uops per logical processor vs. 64 uops per logical processor in previous generations when two sibling logical processors in the same core are active (2×70 vs. 2×64 per core). If only one logical processor is active in the core, the IDQ can hold 70 uops vs. 64 uop


There is a new bypass delay table: "Table 2-3. Bypass Delay Between Producer and Consumer Micro-ops" - but it is identical to the Skylake table, so nothing has changed there.

We have a new cache bandwidth table:

ICL Cache Bandwidth

The 4-cycle L1 latency (which applied only in fairly specific circumstances) is dead: all L1-hit loads are 5 cycles in ICL. Note that the latency didn't increase uniformly by one cycle. That is, the latency didn't go from 4-5 to 5-6, but rather from 4-5 to 5: all loads that L1-hit are 5 cycles, so it's not as bad as a full cycle bump. At least we have some mental burden reduction to partly compensate for the increased latency – you can stop thinking about "complex addressing loads" on Intel.

The first surprising thing is that the idea that the store configuration was 1x64B and 1x16B as reported doesn't seem right: instead you can do either 1x64B or 2x32B per cycle. More on this below.

The sustained bandwidth for the L1 is listed as "same as peak". If true, this is an improvement over SKX which is listed as 192 bytes/cycle peak but only 133 bytes/cycle sustained. Testing will have to confirm whether there is anything to that.

The value of 48 for L2 sustained bandwidth is interesting. This is a nominal downgrade over the figure of 53 for SKX. My testing shows that you can get ~64 bytes/cycle on SKX (i.e., same as peak) just fine, if L2 prefetching is turned off - so the 53 figure apparently comes from (unpredictable) prefetcher interference. Only testing will tell if the sustained figure of 48 is also due to prefetcher interference, or if something more fundamental is happening.

The L3 bandwidth of 32/21 peak/sustained is considerably better than the SKX figure of 16/15. In practice, I've measured much worse SKX bandwidth, usually closer 6 bytes/cycle, but of course this is highly dependent on the test. Does anyone know if ICL is still using non-inclusive L3?

The big reveal is in the "Paired Stores" section, which describes how the second store capability works. It is not fully general, unfortunately: only one store per cycle to the L1, but merging can sometimes merge two store prior to L1 commit. The text:


Paired Stores
Ice Lake Client microarchitecture includes two store pipelines in the core, with the following features:
• Two dedicated AGU for LDs on ports 2 and 3.
• Two dedicated AGU for STAs on ports 7 and 8.
• Two fully featured STA pipelines.
• Two 256-bit wide STD pipelines (AVX-512 store data takes two cycles to write).
• Second senior store pipeline to the DCU via store merging.
Ice Lake Client microarchitecture can write two senior stores to the cache in a single cycle if these two
stores can be paired together. That is:
• The stores must be to the same cache line.
• Both stores are of the same memory type, WB or USWC.
• None of the stores cross cache line or page boundary.

In order to maximize performance from the second store port try to:
• Align store operations whenever possible.
• Place consecutive stores in the same cache line (not necessarily as adjacent instructions).


So you aren't going to get 2/cycle throughput for arbitrary stores with an L1-sized working set: you have to be able to organize consecutive stores into the same line.

The part about 256-bit store pipelines is interesting. It takes 2 cycles to perform a 64-byte store. Does this mean that the store buffer data slots are only 32 bytes wide and a so 64 byte stores take two entries? This would seem to complicate store forwarding for 64 byte stores/loads. Or perhaps the entries remain 64-bytes, but the store ports can only accept and write 32-bytes per cycle. This would presumably reduce the complexity of the bypass networks since the two STD ports would only need to be wired up to accept 256 bits or something like that.

Two store ports will still help for "scattered" stores if the stream of stores is short enough that they can all be buffered: the stores will execute earlier, possibly unclogging some OoO resources and also helping resolve store-to-load conflicts earlier (i.e., avoiding the case where a load stalls due to address or data-unknown store).

One interesting question is how scatter works. The scatter throughput have approximately doubled (per instlat), so they can definitely take advantage of the pairing: the question is whether the stores to the same cache line have to be consecutive in the vector, or can appear anywhere in the vector. Intel leaves the door open for this optimization by specifying that non-overlapping stores within a single scatter may appear in any order. So, in principle, the CPU could order stores in a way which increases pairing. I would bet pretty strongly against it in ICL though.

There is a whole big section on "Ice Lake Client Microarchitecture Power Management" which describes how cores within a package can have different P-states. This is a first for client processors but Xeons had this since forever? Anyway it's there if you want to read it.

A note was added about a relatively obscure penalty when you have two branches in the same eight byte block which whose targets are the same in their lower eight bits. This no longer applies in ICL, but I wasn't even aware of it until now. Text:



Avoid putting multiple conditional branches in the same 8-byte aligned code block (i.e, have their last
bytes' addresses within the same 8-byte aligned code) if the lower 6 bits of their target IPs are the
same. This restriction has been removed in Ice Lake Client and later microarchitectures.


The whole section 3.4.2.4 on the LSD has been re-written. It seems to use figures from ICL now vs the old section that had SNB/IVB and HSW figures. The new text confirms something that has been suspected for some time: loops are effectively "unrolled" in the LSD so there are relatively fewer slots wasted in the last allocation group (because allocation can't span the tail and head of the IDQ in some cycle):



Assume a loop that qualifies for LSD has 23 μops in the loop body. The hardware unrolls the loop such
that it still fits into the μop-queue, in this case twice. The loop in the μop-queue thus takes 46 μops.
The loop is sent to allocation 5 μops per cycle. After 45 out of the 46 μops are sent, in the next cycle only
a single μop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern
repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number
of wasted slots during LSD.


Also, for the first time we have an explicit indication of how long the LSD takes to kick in: "~20 iterations" (from the bullet points).

The section about write-combining buffers includes the following text:


Beginning with Nehalem microarchitecture, there are 10 buffers available for write-combining.
Beginning with Ice Lake Client microarchitecture, there are 12 buffers available for write-combining.


We know that write combining buffers and line-fill buffer are the same thing: on modern Intel a WC buffer is just a LFB that is in "WC mode". I have found that the number of LFBs increased from 10 to 12 from HSW to SKL, and probably even further in CNL. If we assume that the number of WC buffers is exactly the number of LFBs, then this text and my tests are inconsistent. However, maybe only a fraction of the LFBs are available as WC buffers (e.g., because they need to snoopable, but maybe general purpose LFBs don't?). In any case, when I get access to some ICL hardware I will report on apparent available MLP which is the main thing affected by LFB count.

In 3.7.6.1 we get the first indication of the performance and behavior of "Fast Short REP MOVSB":

Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short
operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.
Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H,
ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.


Later on, they give some specific figures for both short and long rep movsb:


With memcpy() on Ice Lake microarchitecture, using in-lined REP MOVSB to implement memcpy is as
fast as a 256-bit AVX implementation for copy lengths that are variable and unknown at compile time.
For lengths that are known at compile time, REP MOVSB is almost as good as 256-bit AVX for short
strings up to 128 bytes (9 cycles vs 3-7 cycles), and better for strings of 2K bytes and longer. For these
cases we recommend using inline REP MOVSB. That said, software should still branch away for zero byte
copies.


Now, I'll believe it when I see it, but if true it means that inlining rep movsb might become a thing again. Of course, stuff like this is how Intel can strike little blows at AMD: since software is often compiled with Intel tuning options, a choice to inline rep movsb would advantage Intel chips over AMD which presumably have worse performance for this pattern.

The 3-7 cycles figure for inline AVX-256 known-length moves seems a bit dubious to me - the best case on existing hardware is 32 bytes / cycle (so 1-4 cycles for 32-128 bytes), if aligned, and ICL should be even better with its 2x32B stores. I guess maybe one can get up to 7 cycles if the stores are unaligned and the compiler doesn't know the alignment to compensate, but the lower bound should probably be 1. Anyways, I'm excited to test this one.

Finally, ICL has flipped back to the old SNB-BDW behavior for AVX transition penalties.

Briefly, SNB through BDW would save the upper bits of a 256-bit reg when a legacy SSE instruction was executed, and then restore it when an AVX (more precisely, [E]VEX-encoded) instruction was executed, because SSE instructions preserve the upper bits, but you don't want every SSE instruction to act like a merge (dependent on the previous value of the destination). Every time you flipped from executing SSE and AVX instructions there was a 10s of cycles penalty to save or restore the upper bits, but other than that there were no per-instruction penalties during ongoing execution.

SKL used a radically differnet strategy. In this strategy there are no bit transition penalties during regular execution but now when the upper bits are dirty *every* SSE instruction is essentially doing a blend with the previous value of the register. This is most obvious for things that would normally be dependency breaking wrt the destination, like a mov or xor, now depends on the previous value. The good news is this probably freed up ~16 physical vector registers which would otherwise be needed to store the dirty uppers.

ICL has apparently reverted back to the SNB-BDW strategy per table 14-3.










 Next Post in Thread >
TopicPosted ByDate
Ice Lake updates to optimization manualTravis Downs2019/10/05 06:38 PM
  Ice Lake updates to optimization manualme2019/10/05 06:54 PM
    Ice Lake updates to optimization manualMichael S2019/10/06 01:30 AM
      Ice Lake updates to optimization manualme2019/10/06 04:11 AM
        Ice Lake updates to optimization manualme2019/10/13 10:15 AM
  Ice Lake updates to optimization manualanon2019/10/06 05:10 AM
  Ice Lake updates to optimization manualanonlitmus2019/10/06 06:51 AM
    Ice Lake updates to optimization manualanon2019/10/06 03:03 PM
      Ice Lake updates to optimization manualTravis2019/10/06 03:06 PM
        Ice Lake updates to optimization manualTravis2019/10/06 03:11 PM
          Ice Lake updates to optimization manualanonlitmus2019/10/06 07:15 PM
  Ice Lake updates to optimization manualGabriele Svelto2019/10/06 02:50 PM
  Ice Lake updates to optimization manualLinus Torvalds2019/10/06 06:49 PM
    Ice Lake updates to optimization manualanonlitmus2019/10/06 07:23 PM
    Ice Lake updates to optimization manualTravis2019/10/07 07:05 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?