You can do two 4-cycle loads per cycle

By: anon (spam.delete.delete@this.this.spam.com), September 18, 2018 1:27 pm
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 12:29 pm wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 11:53 am wrote:
> > Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 10:58 am wrote:
> > > anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
> > >
> > > > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
> > >
> > > Yes, it can - at least on SKL and IVB (the two archs I tested on).
> > ...
> > If throughput isn't the problem and it only happens when the loads immediately follow each
> > other then it might be something different. Maybe it's skipping the TLB lookup altogether.
>
> I think it still needs the TLB lookup, and in fact the TLB lookup is still more or less on
> the critical path since the addresses here are arbitrary and it needs the tag to select the
> right way from the L1D set, whose access happens in parallel. I don't think any type of way
> prediction is happening here since anyways the pointer chasing case doesn't lend itself to
> it and also the 4-cycle latency is consistent even with "randomly" distributed addresses.
>

Then how do you explain the restriction? What prevents the use of the fast path with registers that weren't the result of an earlier load?

> What I think happens is that it just sends the base register value (e.g., rax in the examples),
> which is exactly the value loaded earlier from L1D so probably sitting on the bypass network,
> to the TLB as soon as it is ready - without actually doing the address calculation: this is
> correct for the TLB lookup as long as the page ends up the same as the full address, because
> the TLB lookup cares only about the page, not the offset within the page.
>
> This lets you start the TLB lookup earlier than normal since
> you can do it in parallel with the address calculation.
>
> So the normal, 5-cycle path is: address-calcuation first, then TLB lookup with the full address in parallel
> with accessing the L1D set, and then selecting the way/checking tag again in serial based on both the
> TLB result and the L1D access, like so:
                      --> TLB lookup     --
address calculation
> -/--> L1D set access ----> way selection/tag check
The fast path, detected already at the RAT, I think,
> moves the TLB lookup to be in parallel with the the address calculation, but this means it can't use
> the full address, it can just use the plain result of the load without any offset, like so:

--> TLB
> lookup (no offset) ----------------------
-/--> fast address calculation ---> L1D set access ----> way
> selection/tag check
This seems like it could easily shave a cycle off of the latency for the TLB lookup,
> since the address calculation presumable takes a cycle. Furthermore, since the addressing mode is "simple"
> (not index), it seems entirely possible that a cycle is also shaved off of that path, so even if both
> calculations were on the critical path, a cycle is shaved off of each.

There is no need to explain that over and over again.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?