By: Maynard Handley (name99.delete@this.name99.org), July 9, 2015 10:04 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on July 9, 2015 7:20 am wrote:
> Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> > Sylvain Collange (sylvain.collange.delete.delete@this.this.gmail.com) on July 8, 2015 10:32 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 9:46 am wrote:
> > > > BTW, seeing Andre Seznec's name there, does any commercial
> > > > processor yet implement a PPM or TAGE-like predictor yet?
> > >
> > > I am not aware of any official statement about a commercial TAGE implementation.
> > >
> > > But comparing Haswell's performance counters with the output of a TAGE simulator, we observe
> > > comparable branch misprediction rates on average. (http://hal.inria.fr/hal-01100647/)
> > >
> >
> > Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> > I assume are basically an interpreter, and thus as good a proxy as one can hope
> > for in figuring out this stuff). The results are very interesting.
> > A8 gets 1787/1.4 =1270 (score/frequency)
> > Sandy Bridge gets 4269/3.4=1255
> > Haswell gets 4325/3.3=1310
> > Nehalem gets 2284/3.2=713
> > (64-bit for everything except the Nehalem result where I could not find a 64-bit Window result.
> > For some strange reason there are also no Broadwell 64-bit results yet; out of interest the
> > 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a 10% penalty for
> > 32-bit mode, giving us some feel for what a 64-bit Nehalem result might be.)
> >
> > It's merely a hint, not a proof, but it suggests that the intuition is correct (that is it gives the
> > expected big jump in performance for an interpreter from Nehalem to Sandy Bridge). It also suggests
> > that whatever Apple is using for their branch predictor it's pretty impressive.
>
> Yet again, you are committing the same error in order to show your favorite
> chip in the best possible light (i.e., normalizing by frequency).
David do we have to fight about this every time?
I do the normalization by frequency because I am interested in the ALGORITHMS that are being used. I've said, on every occasion that the point comes up, that Intel's turbo performance is remarkable.
The relevant Atom x7 numbers are 1094/1.6=684. Given them a 10% boost (because every benchmark I see is 32 bit Windows, I assume because they're using the free version) and they're still vastly inferior to Apple. (Probably using the Nehalem algorithm.)
Perhaps what Apple is doing is not quite as trivial, uninteresting, and deserving of contempt as you immediately think?
> You realize that Apple's branch predictor has 2.5-3X longer to make predictions than Intel's, right? Since
> most table accesses tend to grow like log(size), that means their prediction tables can be vastly larger.
The branch predictor I am interested in (and which I am assuming is being demonstrated here) is the indirect branch predictor. Naively giving that larger tables is not going to help much; the problem is that while there is history to be exploited in an interpreter, it's much more sophisticated history than naive "use the same target for this PC that was used last time".
To me it makes sense that Apple has fitted the best indirect branch predictor they could. From C++ to Swift to Obj C, they are interested in languages that use a substantial number of indirect jumps.
There's a time for prognostication about who will do well in future, the consequences of business decisions, even cheering on your team.
There is ALSO a time for trying to understand the algorithms that are present in a particular chip. This is one of those latter times.
We still, for example, have no idea about the nature and quality of the prefetchers in either the A8 or Haswell/Broadwell; and if I figure out a way to shed light on that, and if that involves normalizing by frequency, that's how I will proceed, and I don't think that's something I have to apologize for. The goal is to understand the relevant chips.
> Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> > Sylvain Collange (sylvain.collange.delete.delete@this.this.gmail.com) on July 8, 2015 10:32 am wrote:
> > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 9:46 am wrote:
> > > > BTW, seeing Andre Seznec's name there, does any commercial
> > > > processor yet implement a PPM or TAGE-like predictor yet?
> > >
> > > I am not aware of any official statement about a commercial TAGE implementation.
> > >
> > > But comparing Haswell's performance counters with the output of a TAGE simulator, we observe
> > > comparable branch misprediction rates on average. (http://hal.inria.fr/hal-01100647/)
> > >
> >
> > Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> > I assume are basically an interpreter, and thus as good a proxy as one can hope
> > for in figuring out this stuff). The results are very interesting.
> > A8 gets 1787/1.4 =1270 (score/frequency)
> > Sandy Bridge gets 4269/3.4=1255
> > Haswell gets 4325/3.3=1310
> > Nehalem gets 2284/3.2=713
> > (64-bit for everything except the Nehalem result where I could not find a 64-bit Window result.
> > For some strange reason there are also no Broadwell 64-bit results yet; out of interest the
> > 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a 10% penalty for
> > 32-bit mode, giving us some feel for what a 64-bit Nehalem result might be.)
> >
> > It's merely a hint, not a proof, but it suggests that the intuition is correct (that is it gives the
> > expected big jump in performance for an interpreter from Nehalem to Sandy Bridge). It also suggests
> > that whatever Apple is using for their branch predictor it's pretty impressive.
>
> Yet again, you are committing the same error in order to show your favorite
> chip in the best possible light (i.e., normalizing by frequency).
David do we have to fight about this every time?
I do the normalization by frequency because I am interested in the ALGORITHMS that are being used. I've said, on every occasion that the point comes up, that Intel's turbo performance is remarkable.
The relevant Atom x7 numbers are 1094/1.6=684. Given them a 10% boost (because every benchmark I see is 32 bit Windows, I assume because they're using the free version) and they're still vastly inferior to Apple. (Probably using the Nehalem algorithm.)
Perhaps what Apple is doing is not quite as trivial, uninteresting, and deserving of contempt as you immediately think?
> You realize that Apple's branch predictor has 2.5-3X longer to make predictions than Intel's, right? Since
> most table accesses tend to grow like log(size), that means their prediction tables can be vastly larger.
The branch predictor I am interested in (and which I am assuming is being demonstrated here) is the indirect branch predictor. Naively giving that larger tables is not going to help much; the problem is that while there is history to be exploited in an interpreter, it's much more sophisticated history than naive "use the same target for this PC that was used last time".
To me it makes sense that Apple has fitted the best indirect branch predictor they could. From C++ to Swift to Obj C, they are interested in languages that use a substantial number of indirect jumps.
There's a time for prognostication about who will do well in future, the consequences of business decisions, even cheering on your team.
There is ALSO a time for trying to understand the algorithms that are present in a particular chip. This is one of those latter times.
We still, for example, have no idea about the nature and quality of the prefetchers in either the A8 or Haswell/Broadwell; and if I figure out a way to shed light on that, and if that involves normalizing by frequency, that's how I will proceed, and I don't think that's something I have to apologize for. The goal is to understand the relevant chips.