By: juanrga (nomail.delete@this.juanrga.com), April 8, 2021 12:15 am
Room: Moderated Discussions
Chester (lamchester.delete@this.gmail.com) on April 7, 2021 9:19 am wrote:
> > > > Well the 8380 is only 10% faster single-threaded because of the large L3.
>
> Good caching is a huge part of CPU microarchitecture, and I give Intel for making a bigger,
> lower latency L3 than Ampere. So why didn't Altra make a large L3 and catch up in ST perf?
>
> > > Well Intel's process technology has stalled badly for a lot longer than two years so of course it's not
> > > great progress. What is more concerning for the ARM fanclub is that AMD does pretty damn well when the
> > > process technology is equivalent. Strange, there must be a nasty ARM tax slowing them down to the point
> > > where they're not able to take advantage of the huge alleged x86 tax slowing down AMD so much.
> >
> > I am not sure you understand the concept of x86 tax. It is not about "slowing down", it is about size.
> >
> > https://www.hpcwire.com/2020/03/17/marvell-talks-up-thunderx3-and-arm-server-roadmap/
> >
> >
>
> Or maybe, Rome is less dense because the same core goes to desktop and clocks
> beyond 4 GHz. For some reason, high clocking designs tend to be less dense.
Rome uses 7HPC node to achieve higher clocks; a Cortex A72 achieves 4.2GHz on the same node. The comment made by Marwel is about the size difference "for the same process node technology".
> Zen 2 also has 256-bit vector registers, more scheduler entries, twice the L1D bandwidth, and can sustain
> higher IPC (5 instrs or 6 micro-ops/clk, where ThunderX3 is limited at rename to 4 micro-ops/clk).
That is a peak, not the average IPC. TX2 had a higher average IPC than Zen, and whereas Zen2 increased by ~15%, TX3 increased by ~25%. So TX3 would have higher IPC than Zen2.
> ThunderX3 does have a bigger icache and GPR RF, probably because that's needed to handle
> 4-way SMT. Branch predictor sizes aren't known (and that's a huge part of Zen 2). In any
> case it's not clear how much, if any, size advantage is attributable to ARM vs x86.
I gave a quote from Marvel with an estimation of the x86 tax: "roughly 20% to 25% smaller die area". The remaining 5--10% is due to differences in the µarches.
> Also ThunderX3 got cancelled, if that says anything about how great it is.
It was cancelled for general purpose sockets, but repurposed for "custom-only chip business". Its performance is great

> > > > Well the 8380 is only 10% faster single-threaded because of the large L3.
>
> Good caching is a huge part of CPU microarchitecture, and I give Intel for making a bigger,
> lower latency L3 than Ampere. So why didn't Altra make a large L3 and catch up in ST perf?
>
> > > Well Intel's process technology has stalled badly for a lot longer than two years so of course it's not
> > > great progress. What is more concerning for the ARM fanclub is that AMD does pretty damn well when the
> > > process technology is equivalent. Strange, there must be a nasty ARM tax slowing them down to the point
> > > where they're not able to take advantage of the huge alleged x86 tax slowing down AMD so much.
> >
> > I am not sure you understand the concept of x86 tax. It is not about "slowing down", it is about size.
> >
> > https://www.hpcwire.com/2020/03/17/marvell-talks-up-thunderx3-and-arm-server-roadmap/
> >
> >
Just to give you an idea, in the previous generation,
> > if you look at ThunderX2, compared to AMD or Skylake,
> > for the same process node technology [we get] roughly 20%
> > to 25% smaller die area. That translates into lower
> > power. When we move to 7nm with ThunderX3, our core compared to AMD Rome’s 7nm is roughly 30% smaller.
>
> Or maybe, Rome is less dense because the same core goes to desktop and clocks
> beyond 4 GHz. For some reason, high clocking designs tend to be less dense.
Rome uses 7HPC node to achieve higher clocks; a Cortex A72 achieves 4.2GHz on the same node. The comment made by Marwel is about the size difference "for the same process node technology".
> Zen 2 also has 256-bit vector registers, more scheduler entries, twice the L1D bandwidth, and can sustain
> higher IPC (5 instrs or 6 micro-ops/clk, where ThunderX3 is limited at rename to 4 micro-ops/clk).
That is a peak, not the average IPC. TX2 had a higher average IPC than Zen, and whereas Zen2 increased by ~15%, TX3 increased by ~25%. So TX3 would have higher IPC than Zen2.
> ThunderX3 does have a bigger icache and GPR RF, probably because that's needed to handle
> 4-way SMT. Branch predictor sizes aren't known (and that's a huge part of Zen 2). In any
> case it's not clear how much, if any, size advantage is attributable to ARM vs x86.
I gave a quote from Marvel with an estimation of the x86 tax: "roughly 20% to 25% smaller die area". The remaining 5--10% is due to differences in the µarches.
> Also ThunderX3 got cancelled, if that says anything about how great it is.
It was cancelled for general purpose sockets, but repurposed for "custom-only chip business". Its performance is great
