By: Chester (lamchester.delete@this.gmail.com), September 1, 2021 10:03 am
Room: Moderated Discussions
> I would add the shared decoders in Bulldozer were also a bottleneck without a uOp cache.
> This was fixed in Excavator which added a dedicated set of decoders per thread.
>
> With my arm chair knowledge, I would argue that a shared decoder would have worked
> it was backed by a uOp cache per thread and the decoder itself was wider.
>
That might help MT scaling with high IPC code that fits in L1/L2 caches, but I don't think it'd improve ST perf much. ST perf was Bulldozer's biggest problem.
With scalar integer code, ST perf is held back by the small backend and long latency caches. Increasing frontend bandwidth (uop cache/wider decode) would be solving the wrong problem.
A uop cache might help by reducing the mispredict penalty, but I don't think they had the die area for that anyway. If they did, they could solve more pressing problems by making the L1D bigger, or beefing up the branch predictor.
> This was fixed in Excavator which added a dedicated set of decoders per thread.
>
> With my arm chair knowledge, I would argue that a shared decoder would have worked
> it was backed by a uOp cache per thread and the decoder itself was wider.
>
That might help MT scaling with high IPC code that fits in L1/L2 caches, but I don't think it'd improve ST perf much. ST perf was Bulldozer's biggest problem.
With scalar integer code, ST perf is held back by the small backend and long latency caches. Increasing frontend bandwidth (uop cache/wider decode) would be solving the wrong problem.
A uop cache might help by reducing the mispredict penalty, but I don't think they had the die area for that anyway. If they did, they could solve more pressing problems by making the L1D bigger, or beefing up the branch predictor.