By: Maynard Handley (name99.delete@this.name99.org), July 9, 2015 5:21 pm
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on July 9, 2015 4:15 pm wrote:
> Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:04 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on July 9, 2015 7:20 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> > > > Sylvain Collange (sylvain.collange.delete.delete@this.this.gmail.com) on July 8, 2015 10:32 am wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 9:46 am wrote:
> > > > > > BTW, seeing Andre Seznec's name there, does any commercial
> > > > > > processor yet implement a PPM or TAGE-like predictor yet?
> > > > >
> > > > > I am not aware of any official statement about a commercial TAGE implementation.
> > > > >
> > > > > But comparing Haswell's performance counters with the output of a TAGE simulator, we observe
> > > > > comparable branch misprediction rates on average. (http://hal.inria.fr/hal-01100647/)
> > > > >
> > > >
> > > > Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> > > > I assume are basically an interpreter, and thus as good a proxy as one can hope
> > > > for in figuring out this stuff). The results are very interesting.
> > > > A8 gets 1787/1.4 =1270 (score/frequency)
> > > > Sandy Bridge gets 4269/3.4=1255
> > > > Haswell gets 4325/3.3=1310
> > > > Nehalem gets 2284/3.2=713
> > > > (64-bit for everything except the Nehalem result where I could not find a 64-bit Window result.
> > > > For some strange reason there are also no Broadwell 64-bit results yet; out of interest the
> > > > 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a 10% penalty for
> > > > 32-bit mode, giving us some feel for what a 64-bit Nehalem result might be.)
> > > >
> > > > It's merely a hint, not a proof, but it suggests that the intuition is correct (that is it gives the
> > > > expected big jump in performance for an interpreter from Nehalem to Sandy Bridge). It also suggests
> > > > that whatever Apple is using for their branch predictor it's pretty impressive.
> > >
> >
> > The branch predictor I am interested in (and which I am assuming is being demonstrated here)
> > is the indirect branch predictor. Naively giving that larger tables is not going to help much;
> > the problem is that while there is history to be exploited in an interpreter, it's much more
> > sophisticated history than naive "use the same target for this PC that was used last time".
>
> That's not how indirect predictors work. Typically they index based
> on the global history (i.e., path to arrive at the branch).
How do you know what is "typically" done in indirect predictors? Apple isn't telling us, neither are Intel, ARM, QC or most of the other suspects.
That's the entire point of my little exercise --- to validate to MY satisfaction (if not, apparently, to anyone else's) that Apple (and Intel as of Sandy Bridge and later) are using a certain quality of indirect predictor, which (IMHO) is driven by a TAGE-like set of partial matching tables, rather than simply multiplying simple 2005-style tables 16x in size or whatever.
> Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:04 am wrote:
> > David Kanter (dkanter.delete@this.realworldtech.com) on July 9, 2015 7:20 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> > > > Sylvain Collange (sylvain.collange.delete.delete@this.this.gmail.com) on July 8, 2015 10:32 am wrote:
> > > > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 9:46 am wrote:
> > > > > > BTW, seeing Andre Seznec's name there, does any commercial
> > > > > > processor yet implement a PPM or TAGE-like predictor yet?
> > > > >
> > > > > I am not aware of any official statement about a commercial TAGE implementation.
> > > > >
> > > > > But comparing Haswell's performance counters with the output of a TAGE simulator, we observe
> > > > > comparable branch misprediction rates on average. (http://hal.inria.fr/hal-01100647/)
> > > > >
> > > >
> > > > Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> > > > I assume are basically an interpreter, and thus as good a proxy as one can hope
> > > > for in figuring out this stuff). The results are very interesting.
> > > > A8 gets 1787/1.4 =1270 (score/frequency)
> > > > Sandy Bridge gets 4269/3.4=1255
> > > > Haswell gets 4325/3.3=1310
> > > > Nehalem gets 2284/3.2=713
> > > > (64-bit for everything except the Nehalem result where I could not find a 64-bit Window result.
> > > > For some strange reason there are also no Broadwell 64-bit results yet; out of interest the
> > > > 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a 10% penalty for
> > > > 32-bit mode, giving us some feel for what a 64-bit Nehalem result might be.)
> > > >
> > > > It's merely a hint, not a proof, but it suggests that the intuition is correct (that is it gives the
> > > > expected big jump in performance for an interpreter from Nehalem to Sandy Bridge). It also suggests
> > > > that whatever Apple is using for their branch predictor it's pretty impressive.
> > >
> >
> > The branch predictor I am interested in (and which I am assuming is being demonstrated here)
> > is the indirect branch predictor. Naively giving that larger tables is not going to help much;
> > the problem is that while there is history to be exploited in an interpreter, it's much more
> > sophisticated history than naive "use the same target for this PC that was used last time".
>
> That's not how indirect predictors work. Typically they index based
> on the global history (i.e., path to arrive at the branch).
How do you know what is "typically" done in indirect predictors? Apple isn't telling us, neither are Intel, ARM, QC or most of the other suspects.
That's the entire point of my little exercise --- to validate to MY satisfaction (if not, apparently, to anyone else's) that Apple (and Intel as of Sandy Bridge and later) are using a certain quality of indirect predictor, which (IMHO) is driven by a TAGE-like set of partial matching tables, rather than simply multiplying simple 2005-style tables 16x in size or whatever.