ICL memory renaming?

By: Montaray Jack (none.delete@this.none.org), August 3, 2019 11:53 pm
Room: Moderated Discussions
ll (ll.delete.delete@this.this.gmail.com) on August 3, 2019 8:37 pm wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on August 2, 2019 6:43 pm wrote:
> > Here's an interesting block from the recent ICL InstlatX64 dump:

> > 34 X86 :MOV r8,[m8]+MOV [m8],r8 L: 0.78ns= 1.0c T: 0.60ns= 0.77
> > 35 X86 :MOV r16,[m16]+MOV [m16],r16 L: 9.24ns= 11.8c T: 0.35ns= 0.45
> > 36 X86 :MOV r32,[m32]+MOV [m32],r32 L: 0.40ns= 0.5c T: 1.06ns= 1.35
> > 37 AMD64 :MOV r64,[m64]+MOV [m64],r64 L: 0.44ns= 0.6c T: 0.61ns= 0.78
> >
Assuming I understand the notation correctly these are load-store pairs to the same location, where
> > the stored value comes from the prior load. Something like:
mov eax, [rdx]
mov [rdx], eax
repeat...
>
> > On dumps from existing Intel CPUs, or Zen and Zen2, these are always in the range of 10-20 cycles. However,
> > the results above show that except for the 16-bit variant, the times are 1 cycle or less. For the 32-bit
> > cases, the latency is less than the throughput. It seems there is some magic that recognizes the identical
> > memory location and allows this to resolve perhaps right at rename, similarly to an eliminated move.
> >
> > I haven't seen any hints of this before. Has anyone heard anything?
>
> the latency for skylake is strange
> for mov r64,[m64] + mov [m64], r64, the uop should be this:
> load r64, [m64]
> sta
> std
> for sta and std could isuss in same cycle, as load has 4 cycle latency, and also assume sta and
> std is four cycle, after sta and std, load of the next loop could get data through store to load
> forward, then for one loop, 8 cycle would be max cycle it take, but the result is 19 cycle.
> Inst 34 X86 : MOV r8,[m8]+MOV [m8],r8 L: 2.79ns= 6.2c T: 0.23ns= 0.50c
> Inst 35 X86 : MOV r16,[m16]+MOV [m16],r16 L: 9.10ns= 20.1c T: 1.02ns= 2.25c
> Inst 36 X86 : MOV r32,[m32]+MOV [m32],r32 L: 8.64ns= 19.1c T: 1.81ns= 4.00c
> Inst 37 AMD64 : MOV r64,[m64]+MOV [m64],r64 L: 8.64ns= 19.1c T: 0.45ns= 1.00c
>
> the icelake result is intresting, maybe it use some tech like memory rename

In the GCC commit for zen2, a comment in the header. Any ideas why the 16 bit case is slow?
Header: x86-tune-costs.h
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/config/i386/x86-tune-costs.h;h=8b963c07051d5d2f1aa11936b7a215a7a8265c89;hb=f15d6856b5bc517e425a5bd390bc76cbf9b26a89


1317 /* reg-reg moves are done by renaming and thus they are even cheaper than
1318 1 cycle. Because reg-reg move cost is 2 and following tables correspond
1319 to doubles of latencies, we do not model this correctly. It does not
1320 seem to make practical difference to bump prices up even more. */
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
ICL memory renaming?Travis Downs2019/08/02 06:43 PM
  ICL memory renaming?anonlitmus2019/08/03 07:06 AM
    ICL memory renaming?Travis Downs2019/08/04 02:55 PM
      ICL memory renaming?anonlitmus2019/08/04 03:20 PM
        ICL memory renaming?Travis Downs2019/08/04 04:51 PM
  ICL memory renaming?ll2019/08/03 08:37 PM
    ICL memory renaming?Montaray Jack2019/08/03 11:53 PM
      ICL memory renaming?Montaray Jack2019/08/04 12:44 AM
    ICL memory renaming?Travis Downs2019/08/04 03:00 PM
      ICL memory renaming?ll2019/08/05 06:05 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell purple?