# Mill *is* a speculation

> > If they can get their 33 or 37, can't remember which, instructions
> > per cycle all is well. Branch mispredictions
> > won't affect the significantly more or less than anyone
> > else. But they actually need to get that IPC. I can't
> > remember the exact numbers but I think it was something like
> > 8 ALUs, 8 load units, 4 store units on a Mill Gold.
> > So if we give them the benefit of the doubt, that they can
> > move instructions around in a way that is comparable
> > to OoOE, then assume linear scaling, so 8 ALUs instead of
> > 4 doubles the IPC, then add a bit on top of that because
> > they don't share ports we get maybe 3 times the IPC at 1/3
> > the clockrate. So it's not enough to get about the
> > same IPC per ALU that OoOE would get, the other slots that
> > bring up the count to 30+ need to be useful enough
> > to get it way past that or it won't actually get anywhere close to the 2x advantage they need.
> I am expecting average IPC to be about one order of magnitudes
> smaller than the width. Michael expects IPC of about 2

Would you also expect 4 (or 8 if you count the back-end) OoOE to get 0.4 IPC?

If your argument is "it can't work as well as OoOE because it's not OoOE" then there's no need to bringt width into the argument at all. Your premise is that it can't work, therefore it can't work, the perfect circular argument.

What I'm saying is that if it works as well as OoOE and we give it the benefit of the doubt with linear scaling and ALU ports never being blocked you still only get around 3x. How much that actually is depends on the code, I mean an OoO architecture with 8 wide back-end and 4 wide front-end can't get 8 IPC either, but going from 3.x IPC to 10+ IPC with a backend about 3 times as wide seems on the upper end of realistic expectations. It require everything to work out and near linear scaling, but it's not absurd. But it's the upper end for when things work out. 3x IPC sounds great but if it runs at 1/3 clockrate then we're right back to where we started.

So instead of the circular argument I'm saying that if things work out well then it only ends up being as fast as OoOE, which is nowhere near revolutionary since we already got that.

Or let's look at some simple math. What's the average percentage of ALU operations? 40%? 50%? The usual numbers thrown around are 50/20/10/20 for ALU/load/store/branch. So where do you end up with 8 ALUs? 16-20 IPC, theoretical max I'd say. Similarly Skylake could to 8 with it's 4 ALUs and 8 ports, but the front-end is going to limit it to 6 or less. If there's a 3x difference in clockrate the IPC advantage melts away. So on some well behaved 4+ IPC code running out of the uop I don't see Skylake or Zen being much slower. The ~2 IPC on OoOE cases are where it's at. Can you get 6+ IPC out of those at 1/3 the frequency? Possibly. Can you get 10+ or 12+ IPC? Just working as well as OoOE is not nearly enough for that, it needs to be way better. And that's where the "magic compiler" doubt kicks in.
