By: ex-apple (ex.delete@this.apple.com), October 28, 2014 7:22 pm
Room: Moderated Discussions
Brett (ggtgp.delete@this.yahoo.com) on October 28, 2014 6:24 pm wrote:
> A typical wide OoO is going to have two clusters of three or four ALU's,
> they are talking about LOTS more clusters, after all ALU's are tiny.
Yes, their PR material talks about a wide future! But, the Linley paper reveals a more prosaic prototype - only 2 clusters, each "similar to an A15" in complexity. So maybe 4-6 ALUs in total, 8ish operations/cycle, similar to your wide OoO.
They also saw diminishing returns - in simulation, a 4th cluster added only 10%-20% performance. Also, in their prototype, they reaped the expected IPC benefits of low clock frequency - short pipelines, cheap mispredicts, and fewer cycles to cache and DRAM.
I didn't find evidence their IPC gains would scale to the high-end. I think they'd like to get there, but they probably need more time and money - in other words, customers. For that, they're focusing on Android SoC makers, using the perf/watt argument.
> A typical wide OoO is going to have two clusters of three or four ALU's,
> they are talking about LOTS more clusters, after all ALU's are tiny.
Yes, their PR material talks about a wide future! But, the Linley paper reveals a more prosaic prototype - only 2 clusters, each "similar to an A15" in complexity. So maybe 4-6 ALUs in total, 8ish operations/cycle, similar to your wide OoO.
They also saw diminishing returns - in simulation, a 4th cluster added only 10%-20% performance. Also, in their prototype, they reaped the expected IPC benefits of low clock frequency - short pipelines, cheap mispredicts, and fewer cycles to cache and DRAM.
I didn't find evidence their IPC gains would scale to the high-end. I think they'd like to get there, but they probably need more time and money - in other words, customers. For that, they're focusing on Android SoC makers, using the perf/watt argument.