You can do two 4-cycle loads per cycle

By: Travis Downs (travis.downs.delete@this.gmail.com), September 18, 2018 12:29 pm
Room: Moderated Discussions
anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:53 am wrote:
> Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 10:58 am wrote:
> > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
> >
> > > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
> >
> > Yes, it can - at least on SKL and IVB (the two archs I tested on).
> ...
> If throughput isn't the problem and it only happens when the loads immediately follow each
> other then it might be something different. Maybe it's skipping the TLB lookup altogether.

I think it still needs the TLB lookup, and in fact the TLB lookup is still more or less on the critical path since the addresses here are arbitrary and it needs the tag to select the right way from the L1D set, whose access happens in parallel. I don't think any type of way prediction is happening here since anyways the pointer chasing case doesn't lend itself to it and also the 4-cycle latency is consistent even with "randomly" distributed addresses.

What I think happens is that it just sends the base register value (e.g., rax in the examples), which is exactly the value loaded earlier from L1D so probably sitting on the bypass network, to the TLB as soon as it is ready - without actually doing the address calculation: this is correct for the TLB lookup as long as the page ends up the same as the full address, because the TLB lookup cares only about the page, not the offset within the page.

This lets you start the TLB lookup earlier than normal since you can do it in parallel with the address calculation.

So the normal, 5-cycle path is: address-calcuation first, then TLB lookup with the full address in parallel with accessing the L1D set, and then selecting the way/checking tag again in serial based on both the TLB result and the L1D access, like so:
                      --> TLB lookup     --
address calculation -/--> L1D set access ----> way selection/tag check
The fast path, detected already at the RAT, I think, moves the TLB lookup to be in parallel with the the address calculation, but this means it can't use the full address, it can just use the plain result of the load without any offset, like so:

--> TLB lookup (no offset) ----------------------
-/--> fast address calculation ---> L1D set access ----> way selection/tag check
This seems like it could easily shave a cycle off of the latency for the TLB lookup, since the address calculation presumable takes a cycle. Furthermore, since the addressing mode is "simple" (not index), it seems entirely possible that a cycle is also shaved off of that path, so even if both calculations were on the critical path, a cycle is shaved off of each.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?