By: Patrick Chase (patrickjchase.delete@this.gmail.com), August 21, 2013 12:27 pm
Room: Moderated Discussions
EduardoS (no.delete@this.spam.com) on August 21, 2013 12:02 pm wrote:
> Patrick Chase (patrickjchase.delete@this.gmail.com) on August 21, 2013 11:45 am wrote:
> > The A15 (and the P4 before it) is actually a terrific example of what happens
> > when you try to hit both high frequencies (16 gate delays per stage IIRC) and high
> > IPC. On paper it looks like a monster, but in the real world it's barely competitive
> > with simpler designs like A12 and Krait.
>
> I suspect P4 isn't all that modest and A15 isn't such a speed racer...
I was actually arguing that both A15 and P4 were attempts to create a fairly high-IPC core with aggressive clock rates, and that both of them arguably went too far. A15 is absolutely optimized to be a speed racer - 16 gate delays per stage is quite low, even if those are NAND delays rather than FO4 as some have suggested. The fact that current A15 instantiations haven't been *clocked* aggressively due to power constraints doesn't change the fact that the architecture has a deep pipeline as a result of a high target frequency.
Before anybody brings up modern x86 cores like SB/IB/Haswell - The optimal combination[s] of clock_rate/IPC are a function of transistor budget, process speed, and design "goodness"/refinement. Intel is able to hit high clocks with high real IPC for a few reasons:
1. As Linus never tires of pointing out, Intel's memory subsystems are historically very good, and that enables higher frequencies at any given IPC before the core starves.
2. SB/IB/Haswell all have L0 uop caches and the ability to restart immediately after a mispredict (they don't flush the ROB like many other microarchitectures). These combine to enable a low mispredict penalty in relation to the total pipeline depth.
3. Their L1 Dcaches are fast. Load->use latency is the same number of cycyles as A15, despite clocking up to at up to twice the rate and having higher associativity. IMO Intel made a smart decision here by sticking with PIVT caches - The choice of a PIPT L1 in A15 continues to mystify me, unless they implemented color prediction (mechanically similar to way-prediction, but used to predict the correct color/alias in addition to the correct way) to hide the TLB lookup latency.
4. A whole host of other small refinements such as the Haswell "mov optimization" that we discussed earlier.
5. Intel's 22 nm process is fast enough that they are able to hit high clock rates even with a reported 24 FO4 per-stage combinational delay. This in turn keeps the pipeline depth within reason.
Just my $0.02...
-- Patrick
> Patrick Chase (patrickjchase.delete@this.gmail.com) on August 21, 2013 11:45 am wrote:
> > The A15 (and the P4 before it) is actually a terrific example of what happens
> > when you try to hit both high frequencies (16 gate delays per stage IIRC) and high
> > IPC. On paper it looks like a monster, but in the real world it's barely competitive
> > with simpler designs like A12 and Krait.
>
> I suspect P4 isn't all that modest and A15 isn't such a speed racer...
I was actually arguing that both A15 and P4 were attempts to create a fairly high-IPC core with aggressive clock rates, and that both of them arguably went too far. A15 is absolutely optimized to be a speed racer - 16 gate delays per stage is quite low, even if those are NAND delays rather than FO4 as some have suggested. The fact that current A15 instantiations haven't been *clocked* aggressively due to power constraints doesn't change the fact that the architecture has a deep pipeline as a result of a high target frequency.
Before anybody brings up modern x86 cores like SB/IB/Haswell - The optimal combination[s] of clock_rate/IPC are a function of transistor budget, process speed, and design "goodness"/refinement. Intel is able to hit high clocks with high real IPC for a few reasons:
1. As Linus never tires of pointing out, Intel's memory subsystems are historically very good, and that enables higher frequencies at any given IPC before the core starves.
2. SB/IB/Haswell all have L0 uop caches and the ability to restart immediately after a mispredict (they don't flush the ROB like many other microarchitectures). These combine to enable a low mispredict penalty in relation to the total pipeline depth.
3. Their L1 Dcaches are fast. Load->use latency is the same number of cycyles as A15, despite clocking up to at up to twice the rate and having higher associativity. IMO Intel made a smart decision here by sticking with PIVT caches - The choice of a PIPT L1 in A15 continues to mystify me, unless they implemented color prediction (mechanically similar to way-prediction, but used to predict the correct color/alias in addition to the correct way) to hide the TLB lookup latency.
4. A whole host of other small refinements such as the Haswell "mov optimization" that we discussed earlier.
5. Intel's 22 nm process is fast enough that they are able to hit high clock rates even with a reported 24 FO4 per-stage combinational delay. This in turn keeps the pipeline depth within reason.
Just my $0.02...
-- Patrick