By: anon (anon.delete@this.anon.com), August 11, 2014 4:10 am
Room: Moderated Discussions
Aaron Spink (aaronspink.delete@this.notearthlink.net) on August 11, 2014 12:48 am wrote:
> anon (anon.delete@this.anon.com) on August 10, 2014 9:50 pm wrote:
> > It is reasonable circumstantial evidence, when you look at a wide selection of devices.
> >
> No it is not at all reasonable circumstantial evidence. It ignores so many
> things, including delivered performance, that it is all but meaningless.
It doesn't ignore delivered performance.
>
>
> > Intel has clearly had a history of struggling with decode, that no non-x86 designs (ignoring exotic
> > or ancient ones I don't know much about, like mainframe or vax) have had problems with.
> >
> Where is this clear history of struggling with decode?
From trace cache and uop cache.
> There is none. For instance, average
> IPC estimates and median IPC estimates for code suggest that much more than 2-3 decode is already
> well into diminishing returns.
Those are always taken out of context. Whole program averages are almost worse than nothing when considering design of particular components. Also, *many* things are well into diminishing returns, for some value of diminishing (i.e., balanced properly with increasing costs).
You always want to race to the next load as fast as possible.
> There are some workloads that can benefit from wider decode but
> those are a minscule portion of the workloads that are seen, esp on a mobile or desktop processor
> and even server processors. About the only area that can readily make use of wider decode are
> HPC workloads and even those have better avenues of performance improvement generally.
>
> > Pentium4 - invested vast efforts into incredibly complex and ultimately
> > failed trace cache in order to have just a serial x86 decoder.
> >
> > 1st Atom core - 2-way decoder on SMT2 device! ARM contemporaries had 2-3x the decode bandwidth per thread.
> >
> Neither of these support your argument. P4 was trying a new direction based on data
> available to the benefit of a trace cache.
Based on the benefit of a trace cache **AND** the cost of decoding. The two are obviously taken together.
I refuse to believe that anybody at Intel ever thought the trace cache would be a walk in the park. They may have over estimated their ability to overcome the obvious problems, but they were clearly considering this in context of improving decoding stages by other means. Ergo they were struggling with decode.
> They may of overestimated its benefit,
> but it was data driven and not ideologically driven like your argument is.
>
> Atom was a simple in order core with SMT designed in as a way to increase efficiency
> during stall conditions. It is a pretty logical design point really.
>
> > Yonah - decoded loop buffer far earlier than others. Most x86 designs now include something like this,
> >
>
> Most designs period include loop buffers these days. Even RISC based designs
> with limited decode and RISC based designs with lots of decode. Loop buffers
> are in general an efficiency increase regardless of decode width.
Fair enough, I was corrected on A9.
>
>
>
> > uop cache - complex additional instruction cache layer to avoid
> > fetch and decoders. Been in Intel higher performance
> > cores for many years, but not yet an indication of non-x86
> > implementations using them. Intel has never had wider
> > than 3-way decode without some kind of decoded instruction
> > caching. The old chestnut that non-x86 high end devices
> > require a nuclear power plant to run is, of course, not true for many years now either.
> >
> uop cache is a decoupling mechinism between fetch/decode and dispatch/execute.
That's disingenuous. It's much more than that.
> Its not the first one in a commercial
> processor either. Once again, its a general performance/efficiency win not at all restricted to x86.
What's your evidence for that?
>
>
> > It's not hard evidence, but it beats handwaving. I'm not a chip designer, so I have no credibility or
> > position to say whether x86 decoding is difficult based on experience, so I look at other evidence.
> >
>
> Except what you are using as evidence is basically hand waving.
>
Better than handwaving based on nothing.
> anon (anon.delete@this.anon.com) on August 10, 2014 9:50 pm wrote:
> > It is reasonable circumstantial evidence, when you look at a wide selection of devices.
> >
> No it is not at all reasonable circumstantial evidence. It ignores so many
> things, including delivered performance, that it is all but meaningless.
It doesn't ignore delivered performance.
>
>
> > Intel has clearly had a history of struggling with decode, that no non-x86 designs (ignoring exotic
> > or ancient ones I don't know much about, like mainframe or vax) have had problems with.
> >
> Where is this clear history of struggling with decode?
From trace cache and uop cache.
> There is none. For instance, average
> IPC estimates and median IPC estimates for code suggest that much more than 2-3 decode is already
> well into diminishing returns.
Those are always taken out of context. Whole program averages are almost worse than nothing when considering design of particular components. Also, *many* things are well into diminishing returns, for some value of diminishing (i.e., balanced properly with increasing costs).
You always want to race to the next load as fast as possible.
> There are some workloads that can benefit from wider decode but
> those are a minscule portion of the workloads that are seen, esp on a mobile or desktop processor
> and even server processors. About the only area that can readily make use of wider decode are
> HPC workloads and even those have better avenues of performance improvement generally.
>
> > Pentium4 - invested vast efforts into incredibly complex and ultimately
> > failed trace cache in order to have just a serial x86 decoder.
> >
> > 1st Atom core - 2-way decoder on SMT2 device! ARM contemporaries had 2-3x the decode bandwidth per thread.
> >
> Neither of these support your argument. P4 was trying a new direction based on data
> available to the benefit of a trace cache.
Based on the benefit of a trace cache **AND** the cost of decoding. The two are obviously taken together.
I refuse to believe that anybody at Intel ever thought the trace cache would be a walk in the park. They may have over estimated their ability to overcome the obvious problems, but they were clearly considering this in context of improving decoding stages by other means. Ergo they were struggling with decode.
> They may of overestimated its benefit,
> but it was data driven and not ideologically driven like your argument is.
>
> Atom was a simple in order core with SMT designed in as a way to increase efficiency
> during stall conditions. It is a pretty logical design point really.
>
> > Yonah - decoded loop buffer far earlier than others. Most x86 designs now include something like this,
> >
>
> Most designs period include loop buffers these days. Even RISC based designs
> with limited decode and RISC based designs with lots of decode. Loop buffers
> are in general an efficiency increase regardless of decode width.
Fair enough, I was corrected on A9.
>
>
>
> > uop cache - complex additional instruction cache layer to avoid
> > fetch and decoders. Been in Intel higher performance
> > cores for many years, but not yet an indication of non-x86
> > implementations using them. Intel has never had wider
> > than 3-way decode without some kind of decoded instruction
> > caching. The old chestnut that non-x86 high end devices
> > require a nuclear power plant to run is, of course, not true for many years now either.
> >
> uop cache is a decoupling mechinism between fetch/decode and dispatch/execute.
That's disingenuous. It's much more than that.
> Its not the first one in a commercial
> processor either. Once again, its a general performance/efficiency win not at all restricted to x86.
What's your evidence for that?
>
>
> > It's not hard evidence, but it beats handwaving. I'm not a chip designer, so I have no credibility or
> > position to say whether x86 decoding is difficult based on experience, so I look at other evidence.
> >
>
> Except what you are using as evidence is basically hand waving.
>
Better than handwaving based on nothing.