By: Patrick Chase (patrickjchase.delete@this.gmail.com), July 2, 2013 8:25 am
Room: Moderated Discussions
anon (anon.delete@this.anon.com) on July 1, 2013 12:32 am wrote:
> EduardoS (no.delete@this.spam.com) on June 30, 2013 1:26 pm wrote:
> > anon (anon.delete@this.anon.com) on June 30, 2013 10:41 am wrote:
> > > This is your only justification for your assertion that multiple producers in some
> > > highly competitive markets are spending effort on useless product changes?
> >
> > If you ignore half of my post...
>
> None of your post provided any other real evidence or logic.
>
> >
> > > How easy do you think it is to double floating point performance?
> >
> > Depends, what's the starting point? Doubling SIMD width or pipelining the FPU is pretty
> > easy, if your FPU is small compared to the rest of the core, it is also cheap.
>
> Intel doubled SIMD width in SandyBridge and had to redesign the pipeline to be PRF-based.
Taken on its own this is a valid technical point. In a classic reservation-station-based Tomasulo machine like Nehalem you have to implement the following:
1. An all-to-all (from all functional units to all reservations stations) common result bus, sized to accommodate the largest possible result from each functional unit. Reservation stations use this to "capture" any virtual operands they may be waiting for.
2. Storage for the largest possible result value in each reservation station entry
The savings from doing this aren't as large as you might think, though, because to maintain the same latency/performance in a PRF design you have to implement forwarding networks that partially replicate the result bus. You come out ahead only inasmuch as you can "prune" those forwarding networks. For example, I believe that SB only implements forwarding within each functional unit "stack" - Inter-stack dependencies go through the RF.
It is therefore possible that the move to 256-bit AVX was the final straw that pushed Intel to go to a PRF-based design in SB, as they would have otherwise had to double the sizes of some lanes of the result bus and the RS result storage. With that said, here are two other factors to consider:
1. Intel also used PRFs in their previous all-new high end microarchitecture (Pentium 4, which only had SSE), so they were headed in that direction long before AVX. I suspect that Merom and Nehalem used RS-based backends mostly because of their P6 heritage. SB was their first new high-end uarch since they abandoned P4 and therefore their first real chance to switch to PRFs (again).
2. Even people who do smaller vectors ( > And then your core will costs a lot more than a simple A7 and still have the same
> > horrible integer performance, that's exactly what Linus were arguing against.
>
> But you said that designers and consumers prefer to pay for things which are
> not relevant to their workloads. If high FLOPS is one of those things, moving
> to a GPU-like core would be a cheap way to win useless benchmarks.
How do you think a GPU would perform on, say, SPECfp or even something like whetstone?
> EduardoS (no.delete@this.spam.com) on June 30, 2013 1:26 pm wrote:
> > anon (anon.delete@this.anon.com) on June 30, 2013 10:41 am wrote:
> > > This is your only justification for your assertion that multiple producers in some
> > > highly competitive markets are spending effort on useless product changes?
> >
> > If you ignore half of my post...
>
> None of your post provided any other real evidence or logic.
>
> >
> > > How easy do you think it is to double floating point performance?
> >
> > Depends, what's the starting point? Doubling SIMD width or pipelining the FPU is pretty
> > easy, if your FPU is small compared to the rest of the core, it is also cheap.
>
> Intel doubled SIMD width in SandyBridge and had to redesign the pipeline to be PRF-based.
Taken on its own this is a valid technical point. In a classic reservation-station-based Tomasulo machine like Nehalem you have to implement the following:
1. An all-to-all (from all functional units to all reservations stations) common result bus, sized to accommodate the largest possible result from each functional unit. Reservation stations use this to "capture" any virtual operands they may be waiting for.
2. Storage for the largest possible result value in each reservation station entry
The savings from doing this aren't as large as you might think, though, because to maintain the same latency/performance in a PRF design you have to implement forwarding networks that partially replicate the result bus. You come out ahead only inasmuch as you can "prune" those forwarding networks. For example, I believe that SB only implements forwarding within each functional unit "stack" - Inter-stack dependencies go through the RF.
It is therefore possible that the move to 256-bit AVX was the final straw that pushed Intel to go to a PRF-based design in SB, as they would have otherwise had to double the sizes of some lanes of the result bus and the RS result storage. With that said, here are two other factors to consider:
1. Intel also used PRFs in their previous all-new high end microarchitecture (Pentium 4, which only had SSE), so they were headed in that direction long before AVX. I suspect that Merom and Nehalem used RS-based backends mostly because of their P6 heritage. SB was their first new high-end uarch since they abandoned P4 and therefore their first real chance to switch to PRFs (again).
2. Even people who do smaller vectors ( > And then your core will costs a lot more than a simple A7 and still have the same
> > horrible integer performance, that's exactly what Linus were arguing against.
>
> But you said that designers and consumers prefer to pay for things which are
> not relevant to their workloads. If high FLOPS is one of those things, moving
> to a GPU-like core would be a cheap way to win useless benchmarks.
How do you think a GPU would perform on, say, SPECfp or even something like whetstone?