4-cycle L1 latency on Intel not as general as thought

By: anon (spam.delete.delete@this.this.spam.com), September 18, 2018 10:53 am
Room: Moderated Discussions
Travis Downs (travis.downs.delete@this.gmail.com) on September 18, 2018 9:39 am wrote:
> anon (spam.delete.delete@this.this.spam.com) on September 18, 2018 2:43 am wrote:
>
> > Can it do 2 fast path loads in the same cycle? If not it would make sense to prioritize pointer chases.
>
> Good question, I'll test it.
>
> > It could also be bypass delay.
>
> Yes, perhaps there is extra one cycle of delay from a result coming from the ALU and consumed by the ALU - but
> it doesn't generally show up: that is, except for this special 4-cycle case, the latencies are all the same
> whether the dependency chains pass through the ALU or not (in particular, a pointer chase with a complex addressing
> mode takes the expected number of cycles if you add an ALU op: 5 + 1 = 6 for a latency 1 operation).
>
> > Not sure what the likelihood of a store to a calculated adress being reloaded
> > immediately is but I can't imagine it would be very high.
> > Similarly for anything but a pointer chase the extra
> > cycle for a load shouldn't matter when you've got other dependency chains that can execute in parallel.
>
> I didn't follow what you meant by "store to a calculated address". Note that there
> are no stores to memory here, but perhaps you mean a different type of store.
>

I meant that in all other scenarios, including stores, the latency should matter much less.

> About the latency not mattering other than pointer chasing, I think it is kind true, kind of not.
> We like to think of stuff in a binary way like "on the critical path, latency matters 100% and
> off the critical path, latency doens't matter at all", but in real-world code especially outside
> of small high-trip count loops the situation I guess could best be described as "complex".
>

Everything is complex once you go into details. We can only work with simplifications here.

> Adding latency to a load, even if not in a pointer chasing scenario or part of a long dep chain
> still makes dependent operations available for execution later, which sometimes probably reduces
> ILP slightly, so in general you probably have some partial dependence on load latency even outside
> of pointer chasing, which is heavily dependent on the specific code. Maybe you see an average runtime
> impact of 0.2 cycles for every 1 cycle you increase the latency or something like that.
>
> > Like Peter mentioned it also only attempts this for displacements less than 2048,
> > so they probably did the math that this would get them a speedup on average.
> > If you assumed the worst case of 2047 displacement and random adresses half the accesses would fall
> > within the same page so at twice the latency for a replay it would be neutral. With the average
> > displacement between 0 and 2047 usually being far lower than 2047 it should easily be a win.
>
> I think something is off in the math there - yes, it's 2x the latency for a replay, but it's
> not zero latency for a successful fast path: you shave off only a cycle. So compared to the
> base case of 5 cycles, you are looking at a 4 (Haswell) or 5 (Skylake) cycle penalty versus
> a 1 cycle benefit. So the crossover point for Skylake is at 6 successful fast paths for every
> replay (ignoring the extra cost of the replay in terms of executed uops).
>

Yeah I cut out too much in an effort to condense it.

> That makes the choice of 2048 interesting, compared to say 1024. For offsets between 1024
> and 2048, using the "uniformly random accesses" assumption you wouldn't expect the fast path
> to pay off, on average. I have no doubt the number was carefully chosen however, probably
> through simulation - so there might be some hidden effect that makes page crossing less common
> than you'd expect - i.e., the "uniformly random" assumption may be wrong somehow.
>

Correct, if I'm not mistaken for uniformly random the chance of page crossings would be ~37%. If the displacements or memory positions are even slightly biased towards lower numbers you can easily get it below the 20%/~16.6% needed to make it worthwhile, especially if you halve the penalty on most cases where it does happen like SKL does.

> >
> > > I checked and AMD Ryzen doesn't have either of these weird behaviors: simple loads
> > > are 4 cycles even if the address comes from an ALU operation, and they are still
> > > 4 cycles when the base register and full address fall in different pages.
> > >
> > > The penalty on Skylake is 10 cycles, but in a tight made
> > > to trigger the issue on every load you get only 7.5
> > > cycles average: it seems that when the speculation failure
> > > and replay happens, the next load will be a normal
> > > 5-cycle load, but the one after that will speculate again
> > > and try to execute in 4 cycles, so you end up with
> > > alternating latencies of 10,5,10,5 which average out to 7.5. So the worst case is less bad in Skylake.
> >
> > All in one dependency chain? Again, Peter wrote that Skylake won't attempt the fast path if
> > it failed for the previous load. There's probably statistics behind this choice as well.
>
> Yes, the tests were in one dependency chain, although I'm not sure exactly how it works in practice
> (e.g., is the result itself tagged with the "replayed" tag or does the EU track it or something else).
> This was new for Skylake over Haswell, so maybe they wanted to partly remove the glass jaw that occurs
> if for some reason you got an interaction with an allocator such than many offsets cross a page. The
> slow-path loads increased from 9 to 10 cycles on Skylake so maybe they wanted to avoid any apparent
> regression in the worst case. Or perhaps cause and effect are reversed: maybe the extra one cycle in
> the replay case is actually used to implement the use-default-path-if-fast-path-failed behavior.
>
> About statistics, sure - probably to decide whether to implement this at all. About how many
> loads to force into the 5-cycle path if a replay occurs, I'm not convinced that the value of
> 1 was chosen through stats: it's probably just an easy way to implement this, saving one 1 bit
> of state. If you wanted to be really sophisticated you could use an simple IP-based predictor
> that learns which loads tend to have page-crossing behavior - but we're really talking about
> a small fraction of a fraction of a cycle here: almost certainly not worth the complication.
>

Oh I wasn't talking about anything complicated like that. After all just checking if the high bits are 0 was chosen because it's easy to implement, not because 2047 was the exact cutoff point. So just statistics as in "is adding an extra cycle to block once after a failure a net win" or if the added cycle exists for other reasons "is the cheapest method enough to fix the regression".
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
4-cycle L1 latency on Intel not as general as thoughTravis Downs2018/09/17 04:32 PM
  4-cycle L1 latency on Intel not as general as thoughanon2018/09/18 02:43 AM
    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 09:39 AM
      4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 10:53 AM
        4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 11:07 AM
          4-cycle L1 latency on Intel not as general as thoughtanon2018/09/18 11:51 AM
            4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/18 01:52 PM
              4-cycle L1 latency on Intel not as general as thoughtanon2018/09/19 02:40 AM
                4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/19 05:20 PM
                  4-cycle L1 latency on Intel not as general as thoughtSeni2018/09/19 10:28 PM
                    4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/20 05:16 AM
                      4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 02:25 PM
                        4-cycle L1 latency on Intel not as general as thoughtGabriele Svelto2018/09/21 02:46 AM
                  4-cycle L1 latency on Intel not as general as thoughtanon2018/09/20 08:40 AM
                    4-cycle L1 latency on Intel not as general as thoughtTravis Downs2018/09/20 03:01 PM
    You can do two 4-cycle loads per cycleTravis Downs2018/09/18 10:58 AM
      You can do two 4-cycle loads per cycleanon2018/09/18 11:53 AM
        You can do two 4-cycle loads per cycleTravis Downs2018/09/18 12:29 PM
          You can do two 4-cycle loads per cycleanon2018/09/18 01:27 PM
            You can do two 4-cycle loads per cycleWilco2018/09/18 02:37 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:45 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:30 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:34 AM
                    You can do two 4-cycle loads per cycleWilco2018/09/20 02:32 AM
                      You can do two 4-cycle loads per cycleanon2018/09/20 04:35 AM
                      You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:33 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 03:10 PM
            You can do two 4-cycle loads per cycleTravis Downs2018/09/18 03:08 PM
              You can do two 4-cycle loads per cycleGabriele Svelto2018/09/19 01:39 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 05:43 PM
              You can do two 4-cycle loads per cycleanon2018/09/19 02:42 AM
                You can do two 4-cycle loads per cycleTravis Downs2018/09/19 06:09 PM
                  You can do two 4-cycle loads per cycleanon2018/09/20 01:49 AM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 04:38 PM
                    You can do two 4-cycle loads per cycleTravis Downs2018/09/20 07:27 PM
                      You can do two 4-cycle loads per cycleanon2018/09/21 08:08 AM
            Separate RS for ALU vs load/storeTravis Downs2018/12/13 12:55 PM
              Separate RS for ALU vs load/storeanon2018/12/13 02:14 PM
              Separate RS for ALU vs load/storeanon.12018/12/13 09:15 PM
                Separate RS for ALU vs load/storeWilco2018/12/14 04:41 AM
                  Separate RS for ALU vs load/storeanon.12018/12/14 08:08 AM
                    Separate RS for ALU vs load/storeWilco2018/12/14 01:51 PM
              Integer divide also var latencyDavid Kanter2018/12/14 11:45 AM
                Integer divide also var latencyTravis Downs2018/12/14 09:09 PM
              Separate RS for ALU vs load/storeanon22018/12/14 09:57 PM
                Separate RS for ALU vs load/storeTravis Downs2018/12/15 11:00 AM
Reply to this Topic
Name:
Email:
Topic:
Body: No Text
How do you spell avocado?