Haskell Compilation Improvement

By: Maynard Handley (name99.delete@this.redheron9.com), December 31, 2013 5:26 pm
Room: Moderated Discussions
Linus Torvalds (torvalds.delete@this.linux-foundation.org) on April 11, 2013 12:56 pm wrote:
> Eric Bron (eric.bron.delete@this.zvisuel.privatefortest.com) on April 11, 2013 11:10 am wrote:
> >
> > with modern cores like Ivy Bridge it's generally very frustating to toy
> > with explicit prefetch since there is simply 0% speedup (no slowdown either)
> We've seen slowdowns in the kernel.
> Sometimes serious slowdowns.
> For example, some microarchitectures do TLB fills on prefetch (which you'd think makes sense since
> you often do page-crossing prefetches of pointers). But then they actually seem to have trouble
> with the NULL pointer and slow down because the TLB fill fails and it's not zero-cost at all due
> to some uarch stupidity, so you have to do a conditional jump to not prefetch the end of a list.
> And now you slow down because the branch predicts horribly badly for the common case of short lists,
> and your nice cache behavior where the prefetch didn't do anything actually suffers.
> And it's very annoying, because the prefetches probably made sense when they were added,
> and may still work fine on some machines. And on others, they are actively detrimental.
> Sure, you can play games with these things - like dynamically turn them into no-ops by having instruction
> rewriting (so that you don't have to have runtime conditionals etc). The people who advocate sw prefetching
> always have a ".. but but but you could.." excuse. They never seem to get the "and what are you giving me
> in return for all this wasted effort" argument. When most of the time it's zero upside down the line.
> In the end, I think almost every single time we added a prefetch instruction, it came back to bite us five years
> later and it got removed again. And most of the prefetches we still have are probably of negative actual worth,
> but they just haven't gotten removed, because nobody has bothered to do with performance analysis.
> And yes, as you say, some of them remain because they just don't hurt (the nice array based
> ones with tight loops have neither I$ issues nor the above kind of TLB load issues, but they
> also don't tend to have any wins, since hardware does it better anyway these days).
> Some of the prefetches are for things like "we know we are going to write to this, but the first access is a
> read, and if we do a write-prefetch we can avoid the shared state transition". So it's not actually for prefetching
> data per se, it's a hint to the cache state machine. And that may actually make sense (unlike actual *prefetching*
> it is not timing-sensitive), although I'd much rather see the OoO engine notice it on its own.
> Linus

My experience (long ago, on PPC) mostly matches Linus'.SW prefetch is one of those things that seemed like a good idea, but basically never got implemented properly on any actually existing CPU --- it always sucks in some fashion, or get handled so differently in the next CPU that everything you did for the previous gen no longer works.

But that's just repeating what he's said. My real point is that the reason this is still even something of a live issue is perhaps that few people know the state of the are in HW prefetching, in either existing CPUs or in the literature. I'd compare this with branch prediction in, say, 1988, where there was a groping in the academic literature towards the right way to do things but the big breakthroughs in terms of high performance for achievable engineering effort had not yet been made.

gives a number of papers describing various ways one might implement data prefetching for the hard cases (i.e. walking complex data structures of various sorts). There are some impressive results that seem eminently doable along with others that still seem impractical (require, e.g. 2MB+ of storage), but much seems ideas that haven't yet jelled into one or two implementation ideas that seem obviously right, or conceptual ideas that link the various frameworks together.
(It's worth noting that one of the big names in this field, Stephen Somogyi, is at AMD. You'd hope this might translate into a real product, a way for AMD to become relevant again in x86 --- or to turbocharge their ARM server offerings...)

Meanwhile we're hampered by lack of knowledge on the implementation side. We know that Intel and IBM run your obvious stride-based data pre fetchers, and that they've grown in sophisticated from simple one line ahead, to N-line positive stride to N-line positive or negative stride, to crossing page boundaries, with multiple prefetchers active at once.
But there's a lot that "we" (AFAIK) don't know, for example the extent to which pre fetchers interact intelligently with the memory controller (so they result in lower priority accesses), or whether POWER or x86 have moved on to anything smarter than stride-based.

And in ARM I think we're even more blind. I would guess that Apple can't get the IPC we see for Cyclone without data (and instruction) prefetching, but I've seen nothing about its possible details.

Meanwhile as a dark horse in the prefetching world, there is Yale Patt's favorite idea of abandoning the attempt at more and more aggressive OoO and switching to runahead computing, as in, e.g.
I personally would not be surprised to see this idea popup implemented in ARM. Basically it's (IMHO) a way to do the same sort of amount of work as implementing SMT, but in a way that actually achieves something useful for most code. (While, for ARM the CPUs are so small that if you want more SMP, you might as well just replicate the cores.)

Apple seem most willing to be innovative in this space (and most willing to add a feature that adds to the cost of a CPU). Their macroscalar stuff (e.g. US patents 7,395,419 and 8,065,502) seems to be heading to reality, with trademarks now being filed, starting in 2012.
The patents are as incomprehensible as you'd expect patents to be, and seem to be something about HW to disambiguate and run multiple loops in parallel. But let's look past that and assume that what we actually have implemented on the CPU is multiple sets of state, just like SMT, along with some smarts that can gate writeback. Then we have essentially everything you need for runahead, with all the (supposed) benefits it brings you for prefetching.

Summary: I think it's true (and mostly agreed) that SW prefetching is dead. What's not widely known is the extent of interesting HW replacements, or the extent to which any of these are yet implemented.
< Previous Post in ThreadNext Post in Thread >
TopicPosted ByDate
Haskell Compilation ImprovementSymmetry2013/04/09 10:41 AM
  Haskell Compilation ImprovementEric Bron2013/04/09 11:56 AM
  Haskell Compilation ImprovementLinus Torvalds2013/04/09 12:03 PM
    Haskell Compilation ImprovementEduardoS2013/04/09 12:20 PM
      Haskell Compilation ImprovementLinus Torvalds2013/04/09 12:31 PM
        Haskell Compilation ImprovementEduardoS2013/04/09 12:49 PM
    Haskell Compilation Improvement2013/04/11 01:36 AM
      Haskell Compilation ImprovementEric Bron2013/04/11 03:58 AM
        Haskell Compilation ImprovementBrendan2013/04/11 07:06 AM
          Haskell Compilation ImprovementSymmetry2013/04/11 07:45 AM
            Haskell Compilation ImprovementBrendan2013/04/11 11:31 AM
          Haskell Compilation ImprovementEric Bron2013/04/11 08:57 AM
            Haskell Compilation ImprovementBrendan2013/04/11 11:26 AM
              Haskell Compilation ImprovementEric Bron2013/04/11 11:36 AM
                Haskell Compilation ImprovementBrendan2013/04/11 05:00 PM
                  Haskell Compilation ImprovementDavid Kanter2013/04/11 08:50 PM
                    Software prefetching in JVMsGabriele Svelto2013/04/12 03:31 PM
                  Haskell Compilation ImprovementEric Bron2013/04/12 09:12 AM
                    Haskell Compilation ImprovementBrendan2013/04/12 11:40 AM
                      Haskell Compilation ImprovementEric Bron2013/04/12 12:15 PM
                        Haskell Compilation ImprovementBrendan2013/04/12 03:34 PM
                          Haskell Compilation ImprovementEric Bron2013/04/12 10:44 PM
                            Haskell Compilation ImprovementBrendan2013/04/13 02:20 AM
                              Haskell Compilation ImprovementEric Bron2013/04/13 02:32 AM
                                Haskell Compilation ImprovementBrendan2013/04/13 10:18 AM
                                  Haskell Compilation ImprovementEric Bron2013/04/14 01:04 AM
                          Haskell Compilation ImprovementEric Bron2013/04/15 08:34 AM
                            Haskell Compilation ImprovementBrendan2013/04/16 03:26 PM
                              Prefetch compilation testsEric Bron2013/04/21 12:52 AM
        Haskell Compilation Improvementanon2013/04/11 07:14 AM
          Haskell Compilation ImprovementMichael S2013/04/11 07:27 AM
            Haskell Compilation Improvementanon2013/04/11 08:25 AM
              Haskell Compilation ImprovementMichael S2013/04/11 08:37 AM
                Haskell Compilation Improvementbakaneko2013/04/11 09:39 AM
                  Haskell Compilation ImprovementEric Bron2013/04/11 10:08 AM
                    Haskell Compilation Improvementbakaneko2013/04/11 10:36 AM
                    Haskell Compilation Improvementanon2013/04/11 10:54 AM
                      Haskell Compilation ImprovementEric Bron2013/04/11 11:10 AM
                        Haskell Compilation Improvementanon2013/04/11 11:18 AM
                          Haskell Compilation ImprovementEric Bron2013/04/11 11:27 AM
                            Haskell Compilation Improvementanon2013/04/11 12:02 PM
                              Haskell Compilation ImprovementEric Bron2013/04/11 12:09 PM
                                Haskell Compilation ImprovementEric Bron2013/04/11 12:12 PM
                                Haskell Compilation Improvementanon2013/04/11 12:14 PM
                                  Haskell Compilation ImprovementEric Bron2013/04/11 12:30 PM
                                    Haskell Compilation Improvementanon2013/04/11 11:30 PM
                                      Haskell Compilation ImprovementEric Bron2013/04/12 09:25 AM
                                        Haskell Compilation Improvementanon2013/04/12 07:12 PM
                                          Haskell Compilation ImprovementEric Bron2013/04/12 10:51 PM
                                  Prefetch *hints*Konrad Schwarz2013/04/12 08:24 AM
                        Haskell Compilation ImprovementLinus Torvalds2013/04/11 12:56 PM
                          Inherent advantage of software prefetchJouni Osmala2013/04/11 09:41 PM
                            Inherent advantage of software prefetchSeni2013/04/13 03:40 AM
                            Another example: software scatter gather (NT)Megol2013/04/14 02:39 AM
                          Haskell Compilation ImprovementMaynard Handley2013/12/31 05:26 PM
                            Haskell Compilation ImprovementTREZA2013/12/31 05:44 PM
                              Haskell Compilation ImprovementMaynard Handley2013/12/31 07:49 PM
                                Haskell Compilation Improvementanon2013/12/31 10:39 PM
                                  Haskell Compilation ImprovementMaynard Handley2014/01/01 02:04 AM
                                  Haskell Compilation Improvementbakaneko2014/01/01 05:31 AM
                                Haskell Compilation ImprovementGabriele Svelto2014/01/02 07:57 AM
                                  Haskell Compilation ImprovementMichael S2014/01/02 08:37 AM
                                    Haskell Compilation ImprovementGabriele Svelto2014/01/02 10:09 AM
                                    Haskell Compilation ImprovementTREZA2014/01/02 12:43 PM
                            Haskell Compilation ImprovementMaynard Handley2013/12/31 06:07 PM
                            Future core architectures. (Was Haskell Compilation Improvement)Maynard Handley2014/01/03 12:06 AM
                              Speculative multi-threadingDavid Kanter2014/01/03 02:12 AM
                                Speculative multi-threadingMaynard Handley2014/01/03 05:01 AM
                              Future core architectures. (Was Haskell Compilation Improvement)Seni2014/01/03 01:09 PM
                              Future core architectures. (Was Haskell Compilation Improvement)Linus Torvalds2014/01/03 01:27 PM
                            Haskell Compilation ImprovementKonrad Schwarz2014/01/04 09:38 AM
              Haskell Compilation ImprovementEric Bron2013/04/11 09:23 AM
          Haskell Compilation ImprovementEric Bron2013/04/11 08:50 AM
            Haskell Compilation ImprovementEugene Nalimov2013/04/11 09:20 AM
              Haskell Compilation ImprovementEric Bron2013/04/11 09:28 AM
                Haskell Compilation ImprovementEduardoS2013/04/11 07:30 PM
            Haskell Compilation Improvementanon2013/04/11 10:19 AM
              Haskell Compilation ImprovementEric Bron2013/04/11 10:30 AM
                Haskell Compilation Improvementanon2013/04/11 10:50 AM
                  Haskell Compilation ImprovementEric Bron2013/04/11 11:03 AM
                    Haskell Compilation Improvementanon2013/04/11 11:16 AM
                      Haskell Compilation ImprovementEric Bron2013/04/11 11:24 AM
                        Haskell Compilation Improvementanon2013/04/11 12:09 PM
                          Haskell Compilation ImprovementEric Bron2013/04/11 12:43 PM
                            Haskell Compilation Improvementanon2013/04/11 11:27 PM
                              Haskell Compilation ImprovementEric Bron2013/04/12 12:15 AM
                                Haskell Compilation Improvementanon2013/04/12 07:14 PM
                                  Haskell Compilation ImprovementEric Bron2013/04/12 11:01 PM
                      Haskell Compilation ImprovementLinus Torvalds2013/04/11 01:05 PM
                        Haskell Compilation Improvementanon2013/04/11 10:42 PM
                        Haskell Compilation ImprovementRobert David Graham2013/04/12 02:12 PM
        Software prefetch architecturePaul A. Clayton2013/04/11 08:54 AM
          Software prefetch architectureEric Bron2013/04/11 09:06 AM
            Software prefetch architectureMegol2013/04/15 11:03 AM
              Software prefetch architectureEric Bron2013/04/15 11:30 AM
  low barMichael S2013/04/09 04:38 PM
Reply to this Topic
Body: No Text
How do you spell avocado?