By: David Kanter (dkanter.delete@this.realworldtech.com), July 9, 2015 4:15 pm
Room: Moderated Discussions
Maynard Handley (name99.delete@this.name99.org) on July 9, 2015 10:04 am wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on July 9, 2015 7:20 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> > > Sylvain Collange (sylvain.collange.delete.delete@this.this.gmail.com) on July 8, 2015 10:32 am wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 9:46 am wrote:
> > > > > BTW, seeing Andre Seznec's name there, does any commercial
> > > > > processor yet implement a PPM or TAGE-like predictor yet?
> > > >
> > > > I am not aware of any official statement about a commercial TAGE implementation.
> > > >
> > > > But comparing Haswell's performance counters with the output of a TAGE simulator, we observe
> > > > comparable branch misprediction rates on average. (http://hal.inria.fr/hal-01100647/)
> > > >
> > >
> > > Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> > > I assume are basically an interpreter, and thus as good a proxy as one can hope
> > > for in figuring out this stuff). The results are very interesting.
> > > A8 gets 1787/1.4 =1270 (score/frequency)
> > > Sandy Bridge gets 4269/3.4=1255
> > > Haswell gets 4325/3.3=1310
> > > Nehalem gets 2284/3.2=713
> > > (64-bit for everything except the Nehalem result where I could not find a 64-bit Window result.
> > > For some strange reason there are also no Broadwell 64-bit results yet; out of interest the
> > > 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a 10% penalty for
> > > 32-bit mode, giving us some feel for what a 64-bit Nehalem result might be.)
> > >
> > > It's merely a hint, not a proof, but it suggests that the intuition is correct (that is it gives the
> > > expected big jump in performance for an interpreter from Nehalem to Sandy Bridge). It also suggests
> > > that whatever Apple is using for their branch predictor it's pretty impressive.
> >
> > Yet again, you are committing the same error in order to show your favorite
> > chip in the best possible light (i.e., normalizing by frequency).
>
> David do we have to fight about this every time?
Because you ignore the physical aspects of technology in favor of the logical. They aren't separate, they are tightly intertwined.
> I do the normalization by frequency because I am interested in the ALGORITHMS that are being used.
You realize that ALGORITHMS are determined by constraints such as cycle time, storage space, etc., right? A good algorithm at 1MHz is not a good algorithm at higher speeds.
To take an extreme example, a processor that runs at 1Mhz could probably ignore branch prediction entirely and just calculate the branch directly. That's actually much more power efficient than any predictor.
>I've
> said, on every occasion that the point comes up, that Intel's turbo performance is remarkable.
That's not the point. The point is that you have to consider the physical constraints on a logical design. It's not about Apple or Intel. Just think about IBM's zArch processors, which tend to run at 5GHz...that is a very serious constraint on things like TLBs and branch predictors. OFC, IBM uses 15 layers of metal to make up for it, but that doesn't exactly 'fix' the problem.
> The relevant Atom x7 numbers are 1094/1.6=684. Given them a 10% boost (because every
> benchmark I see is 32 bit Windows, I assume because they're using the free version)
> and they're still vastly inferior to Apple. (Probably using the Nehalem algorithm.)
> Perhaps what Apple is doing is not quite as trivial, uninteresting,
> and deserving of contempt as you immediately think?
I'm not saying that; I'm saying that it's probably inappropriate to compare processors where the frequency is so different.
I actually would love to know what Apple is doing engineering-wise, I bet it's quite interesting, since their design space is so different.
> > You realize that Apple's branch predictor has 2.5-3X longer to make predictions than Intel's, right? Since
> > most table accesses tend to grow like log(size), that means their prediction tables can be vastly larger.
>
> The branch predictor I am interested in (and which I am assuming is being demonstrated here)
> is the indirect branch predictor. Naively giving that larger tables is not going to help much;
> the problem is that while there is history to be exploited in an interpreter, it's much more
> sophisticated history than naive "use the same target for this PC that was used last time".
That's not how indirect predictors work. Typically they index based on the global history (i.e., path to arrive at the branch).
> To me it makes sense that Apple has fitted the best indirect branch predictor they could. From C++ to
> Swift to Obj C, they are interested in languages that use a substantial number of indirect jumps.
Yes, that would make sense.
> There's a time for prognostication about who will do well in future,
> the consequences of business decisions, even cheering on your team.
> There is ALSO a time for trying to understand the algorithms that are
> present in a particular chip. This is one of those latter times.
> We still, for example, have no idea about the nature and quality of the prefetchers in either
> the A8 or Haswell/Broadwell; and if I figure out a way to shed light on that, and if that
> involves normalizing by frequency, that's how I will proceed, and I don't think that's something
> I have to apologize for. The goal is to understand the relevant chips.
Maybe start by learning about indirect branch predictors?
Prefetchers can be tested using various data access patterns.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on July 9, 2015 7:20 am wrote:
> > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 8:23 pm wrote:
> > > Sylvain Collange (sylvain.collange.delete.delete@this.this.gmail.com) on July 8, 2015 10:32 am wrote:
> > > > Maynard Handley (name99.delete@this.name99.org) on July 8, 2015 9:46 am wrote:
> > > > > BTW, seeing Andre Seznec's name there, does any commercial
> > > > > processor yet implement a PPM or TAGE-like predictor yet?
> > > >
> > > > I am not aware of any official statement about a commercial TAGE implementation.
> > > >
> > > > But comparing Haswell's performance counters with the output of a TAGE simulator, we observe
> > > > comparable branch misprediction rates on average. (http://hal.inria.fr/hal-01100647/)
> > > >
> > >
> > > Inspired by this paper, I looked at the Geekbench3 Lua single core results (which
> > > I assume are basically an interpreter, and thus as good a proxy as one can hope
> > > for in figuring out this stuff). The results are very interesting.
> > > A8 gets 1787/1.4 =1270 (score/frequency)
> > > Sandy Bridge gets 4269/3.4=1255
> > > Haswell gets 4325/3.3=1310
> > > Nehalem gets 2284/3.2=713
> > > (64-bit for everything except the Nehalem result where I could not find a 64-bit Window result.
> > > For some strange reason there are also no Broadwell 64-bit results yet; out of interest the
> > > 32-bit result is 2693/2.3=1.17, which perhaps we can take as indicating a 10% penalty for
> > > 32-bit mode, giving us some feel for what a 64-bit Nehalem result might be.)
> > >
> > > It's merely a hint, not a proof, but it suggests that the intuition is correct (that is it gives the
> > > expected big jump in performance for an interpreter from Nehalem to Sandy Bridge). It also suggests
> > > that whatever Apple is using for their branch predictor it's pretty impressive.
> >
> > Yet again, you are committing the same error in order to show your favorite
> > chip in the best possible light (i.e., normalizing by frequency).
>
> David do we have to fight about this every time?
Because you ignore the physical aspects of technology in favor of the logical. They aren't separate, they are tightly intertwined.
> I do the normalization by frequency because I am interested in the ALGORITHMS that are being used.
You realize that ALGORITHMS are determined by constraints such as cycle time, storage space, etc., right? A good algorithm at 1MHz is not a good algorithm at higher speeds.
To take an extreme example, a processor that runs at 1Mhz could probably ignore branch prediction entirely and just calculate the branch directly. That's actually much more power efficient than any predictor.
>I've
> said, on every occasion that the point comes up, that Intel's turbo performance is remarkable.
That's not the point. The point is that you have to consider the physical constraints on a logical design. It's not about Apple or Intel. Just think about IBM's zArch processors, which tend to run at 5GHz...that is a very serious constraint on things like TLBs and branch predictors. OFC, IBM uses 15 layers of metal to make up for it, but that doesn't exactly 'fix' the problem.
> The relevant Atom x7 numbers are 1094/1.6=684. Given them a 10% boost (because every
> benchmark I see is 32 bit Windows, I assume because they're using the free version)
> and they're still vastly inferior to Apple. (Probably using the Nehalem algorithm.)
> Perhaps what Apple is doing is not quite as trivial, uninteresting,
> and deserving of contempt as you immediately think?
I'm not saying that; I'm saying that it's probably inappropriate to compare processors where the frequency is so different.
I actually would love to know what Apple is doing engineering-wise, I bet it's quite interesting, since their design space is so different.
> > You realize that Apple's branch predictor has 2.5-3X longer to make predictions than Intel's, right? Since
> > most table accesses tend to grow like log(size), that means their prediction tables can be vastly larger.
>
> The branch predictor I am interested in (and which I am assuming is being demonstrated here)
> is the indirect branch predictor. Naively giving that larger tables is not going to help much;
> the problem is that while there is history to be exploited in an interpreter, it's much more
> sophisticated history than naive "use the same target for this PC that was used last time".
That's not how indirect predictors work. Typically they index based on the global history (i.e., path to arrive at the branch).
> To me it makes sense that Apple has fitted the best indirect branch predictor they could. From C++ to
> Swift to Obj C, they are interested in languages that use a substantial number of indirect jumps.
Yes, that would make sense.
> There's a time for prognostication about who will do well in future,
> the consequences of business decisions, even cheering on your team.
> There is ALSO a time for trying to understand the algorithms that are
> present in a particular chip. This is one of those latter times.
> We still, for example, have no idea about the nature and quality of the prefetchers in either
> the A8 or Haswell/Broadwell; and if I figure out a way to shed light on that, and if that
> involves normalizing by frequency, that's how I will proceed, and I don't think that's something
> I have to apologize for. The goal is to understand the relevant chips.
Maybe start by learning about indirect branch predictors?
Prefetchers can be tested using various data access patterns.
David