By: Exophase (exophase.delete@this.gmail.com), August 25, 2016 11:32 am
Room: Moderated Discussions
Michael S (already5chosen.delete@this.yahoo.com) on August 25, 2016 10:36 am wrote:
> My mental model is based on description of Sandy Bridge microarchitecture in Optimization
> reference manual. They don't state it with 100% certaincy, but it appears that decoder
> always processes fetched 16B chunks in their entirety, even when there is predicted taken
> conditional branch in the middle. May be, even 32B chunks, I am not sure about it.
>
> One thing that is absolutely certain: "All micro-ops in a Way* represent instructions which are
> statically contiguous in the code and have their EIPs within the same aligned 32-byte region".
>
> * Way = 1/256th of the Decoded ICache, can holds 1 to 6 micro-Ops.
>
>
There's also the description of instruction predecode:
"The predecode unit accepts the 16 bytes from the instruction cache and determines
the length of the instructions."
If this works like in previous P6-derived uarchs, and it probably does, the 16 bytes are taken from somewhere in the middle of the 32-byte chunk at the head of the prefetch buffer. So that the next instruction appears at the start of the 16-bytes.
If the four decoders could handle instructions from different 16-byte blocks predecode would have to happen multiple times a cycle to not be a bottleneck. But then, the 16-byte/cycle instruction fetch (which we know is the design) would become a bottleneck instead.
These are the problems trache caches are meant to solve (while creating other serious problems). But I think with 4+ uops/cycle from the uop cache it's enough to buffer out the shortfall from taken branches, except in really branchy code. Such branchy code probably doesn't tend to have a ton of ILP anyway.
> My mental model is based on description of Sandy Bridge microarchitecture in Optimization
> reference manual. They don't state it with 100% certaincy, but it appears that decoder
> always processes fetched 16B chunks in their entirety, even when there is predicted taken
> conditional branch in the middle. May be, even 32B chunks, I am not sure about it.
>
> One thing that is absolutely certain: "All micro-ops in a Way* represent instructions which are
> statically contiguous in the code and have their EIPs within the same aligned 32-byte region".
>
> * Way = 1/256th of the Decoded ICache, can holds 1 to 6 micro-Ops.
>
>
There's also the description of instruction predecode:
"The predecode unit accepts the 16 bytes from the instruction cache and determines
the length of the instructions."
If this works like in previous P6-derived uarchs, and it probably does, the 16 bytes are taken from somewhere in the middle of the 32-byte chunk at the head of the prefetch buffer. So that the next instruction appears at the start of the 16-bytes.
If the four decoders could handle instructions from different 16-byte blocks predecode would have to happen multiple times a cycle to not be a bottleneck. But then, the 16-byte/cycle instruction fetch (which we know is the design) would become a bottleneck instead.
These are the problems trache caches are meant to solve (while creating other serious problems). But I think with 4+ uops/cycle from the uop cache it's enough to buffer out the shortfall from taken branches, except in really branchy code. Such branchy code probably doesn't tend to have a ton of ILP anyway.