By: Anon (no.delete@this.spam.com), August 4, 2022 2:54 pm
Room: Moderated Discussions
NoSpammer (no.delete@this.spam.com) on August 4, 2022 3:18 am wrote:
> I guess, apart from stack housekeeping, the primary reason is many instructions have high or variable
> latency, which means that you will not be able to reuse the top of stack immediately. So you either
> need to shuffle or you continue with dependent instruction anyways and let OOO handle that. But still,
> for optimal execution it's more optimal to put instructions closer to the order of execution, to release
> resources earlier. Even if you use stacks the optimal number of addressable sources will be close to
> what the register requirements studies have found, so you will have to address about 16-32 somehow,
> if not directly then you will be shuffling like x87. So I see no advantage for stacks.
I don't understand your worries about optimal instruction placement in the context of deep OoO.
> Compare that to stack housekeeping and needing to resolve some of the semantic of the
> instructions before you are even able to evaluate dependencies properly. My bet would
> be that you would already need to predecode to get even on par at renaming stage.
Such microarchitecture today woul likely use a uop cache, or even a datagram cache, so the question is wether the decode costs more or less than a few extra memory fetches, and I know memory fetches costs a lot, but have no idea about the decode for this hypotethical CPU.
> I guess, apart from stack housekeeping, the primary reason is many instructions have high or variable
> latency, which means that you will not be able to reuse the top of stack immediately. So you either
> need to shuffle or you continue with dependent instruction anyways and let OOO handle that. But still,
> for optimal execution it's more optimal to put instructions closer to the order of execution, to release
> resources earlier. Even if you use stacks the optimal number of addressable sources will be close to
> what the register requirements studies have found, so you will have to address about 16-32 somehow,
> if not directly then you will be shuffling like x87. So I see no advantage for stacks.
I don't understand your worries about optimal instruction placement in the context of deep OoO.
> Compare that to stack housekeeping and needing to resolve some of the semantic of the
> instructions before you are even able to evaluate dependencies properly. My bet would
> be that you would already need to predecode to get even on par at renaming stage.
Such microarchitecture today woul likely use a uop cache, or even a datagram cache, so the question is wether the decode costs more or less than a few extra memory fetches, and I know memory fetches costs a lot, but have no idea about the decode for this hypotethical CPU.