ICL memory renaming?

By: Travis Downs (travis.downs.delete@this.gmail.com), August 4, 2019 4:51 pm
Room: Moderated Discussions
anonlitmus (anon.delete@this.litmus.org) on August 4, 2019 3:20 pm wrote:

> Well, sure, you are correct. My point here is just that if the claim is that there is some memory renaming
> going on and it is implemented through the rename table like move elimination then I would argue that it appears
> equally hard to do it for 8 and for 16-bit registers just by virtue of x86 partial register semantics.

Yes, it may be equally hard for 8-bit and 16-bit, but still possible and so they did it only for 8-bit and not 16-bit. There are other optimizations like this that they've done for 8-bit that they didn't do for 16-bit despite similar "merge" semantics.

See for example this whole thing which documents how the 8-bit registers work on current and past chips. In some cases they jumped through a lot of hoops to make certain patterns of 8-bit code fast: nothing stopping them from doing it for 16-bit code *too* but it probably wasn't worth it.

> So I
> don't really see how the 8-bit version would benefit from it if the 16-bit version can't, that's all.

Imagine that they only have the budget for a certain number of "slots" to symbolically cache past stores: perhaps these slots are segregated by size (after all, need to match sizes, or perhaps allow some more general cases like smaller load from a wider contained store). Then, you might not want to waste any hardware on 16-bit slots.

> Say I store al to [X], then load [X] into bl. Can I shortcircuit it through the rename map? Not really because
> the physical register that is mapped to al may actually be mapped to ax and also have ah, and I do not want bh
> to have the same value as ah. Correct me if that is not legit x86 semantics, I might have misunderstood.

Well first it depends how you load. movzx (except 16-bit movzx!) is handled just as efficiently as mov for 8-bit regs. So movzx al, [m8] is very a common way to load, when you want the upper bits cleared (or when you just don't care - better to zero the uppers since it breaks the dependency). movzx is probably considerably more common in compiler than generated code than mov for byte loads (some examples - oddly clang prefers a plain mov for the first example, probably a missed optimization).

Second, even if you do a plain load which has to leave the upper bits intact, the register might internally act as if it was 8 bits since it has been separately renamed as an optimization. See the SO link above for all the gory details, but on modern Intel this doesn't seem to happen for the low 8 bit regs, but it does for the high. The low 8 bit regs use fast merge which usually means they are not slower in latency and no more uops than the versions that can zero - but still have a dependency on the upper bits.

So then, assume you are using a plain load, and you know this renaming stuff isn't taking effect (e.g,. because you are using the low 8). Even then I think you can do the rename shortcut sure - the merging semantics don't affect the memory renaming part: you can still identify a forwarding load through whatever black magic address matching is happening: but you can't just rename the load destination to the store source, correct. Instead you have to merge the low 8 bits, but this happens in a single cycle, with a single p0156 uop, so it is still really cheap. It does mean that you have a dependency on the old value of the full reg sure: so a loop like the Instal one would be limited to 1 cycle per iteration, and in fact that's what we see: 1.0c for the 8-bit version versus ~0.5c for the 32-bit and 64-bit versions.

So yeah I think you can do mem renaming with 8-bit regs just fine but the details differ.

> I suppose they might have a way to identify the cases where the size of the logical producer
> is the same as the logical consumer (that might already be required to detect partial writes
> and do the merges anyway). But then, 16-bit just sounds free if you are going to do 8-bit.

Well the store forwarding network definitely has complex mechanisms to identify that type of stuff, including efficiently forwarding full-contained narrower loads and other cases that used to be handled poorly - but that stuff is kind of slow and is inconsistent with a 0.5c latency for a store/load loop. So this must be handled earlier, perhaps in the front end or rename. You might have a much lower budget there for handling a lot of cases, so there will probably be very specific cases where this works.

< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ICL memory renaming?Travis Downs2019/08/02 06:43 PM
  ICL memory renaming?anonlitmus2019/08/03 07:06 AM
    ICL memory renaming?Travis Downs2019/08/04 02:55 PM
      ICL memory renaming?anonlitmus2019/08/04 03:20 PM
        ICL memory renaming?Travis Downs2019/08/04 04:51 PM
  ICL memory renaming?ll2019/08/03 08:37 PM
    ICL memory renaming?Montaray Jack2019/08/03 11:53 PM
      ICL memory renaming?Montaray Jack2019/08/04 12:44 AM
    ICL memory renaming?Travis Downs2019/08/04 03:00 PM
      ICL memory renaming?ll2019/08/05 06:05 AM
Reply to this Topic
Body: No Text
How do you spell purple?