By: Linus Torvalds (torvalds.delete@this.linux-foundation.org), October 7, 2015 4:49 am
Room: Moderated Discussions
David Kanter (dkanter.delete@this.realworldtech.com) on October 5, 2015 12:24 pm wrote:
>
> With Skylake, it's a little complex - I can see an argument for 6-wide (assuming
> hit in the uop cache) or 5-wide. Both of those are sustainable, although in the
> case of I$ hits, I'm not sure there is really enough fetch bandwidth.
I'd be very surprised if Skylake is "5-wide decode" in the sense that most people think of it and it seems to be talked about here.
In fact, Intel doesn't even say it's 5-wide. Intel says it's "5 uops max". It's possible that that never means "five instructions" - it might be that you only get five uops when one or more of the decoders end up decoding an instruction into multiple uops.
For example, I'd love to see better decoding of read-modify-write instructions, but on Intel big-cores those are traditionally two uops, and they only get decoded by the first decoder. Even though from a decoding standpoint, the read-modify-write instructions are not at all any harder to decode than the normal load-op instructions. It's the exact same modrm format.
So it might well be that Skylake just extends the second decoder that used to only generate a single uop to also be able to emit two uops, so that you can decode two of those memory op instructions in the same cycle. Really, from the standpoint of just parsing the instruction bytes in memory, there is no difference between "add memory to register" and "add register to memory". The only difference is in the uops they result in.
So the "one more uop than Haswell" by no means needs to mean "one more instruction". It could just as easily (in fact, I think more easily) be about just making some of the decoders a bit more flexible.
There are other limits to the x86 decoders that tend to be more painful than "4 instructions". Intel used to have something like a 16-byte total size decode limit. I have some memory of that being extended to 32 bytes from the pure "fetch from L1 I$" standpoint, but there's the whole instruction re-alignment and predecode issue, and here were limits there just how many bytes the ostensibly four instructions could be.
Depending on what the exact rules are, again it might be much more productive to increase those kinds of limits rather than try to go from four instructions to five. Since branch targets are often not aligned, it can be a big deal if you can fetch a full 64 byte cacheline in one chunk and re-align it, because you're then more likely to get the full theoretical three or four instructions decoded after a mispredicted branch.
(And you don't need to do a full unaligned byte shifter for instruction decode - even if you fetch 64 bytes at a time, maybe you'll only align something like 16-24 bytes of them into the decode buffer in order to capture that "likely next four instructions" data)
Side note: I find it interesting that Skylake apparently fixes the "prefetch NULL" problem.
In list handling, you often want to blindly prefetch the next pointer (trying to make it conditional on being valid would just increase the overhead of prefetching to the point where it hurts more than it helps due to the inevitable branch misprediction at the end of the chain traversal), and almost every architecture I've seen gets this wrong, taking TLB or memory pipeline resources for the NULL case. Which is absolutely horrible. We've found prefetching to basically never be a win in real life because the cost of the prefetch is too high.
That may be something we might want to look at in the kernel if Skylake fixed prefetch.
Linus
>
> With Skylake, it's a little complex - I can see an argument for 6-wide (assuming
> hit in the uop cache) or 5-wide. Both of those are sustainable, although in the
> case of I$ hits, I'm not sure there is really enough fetch bandwidth.
I'd be very surprised if Skylake is "5-wide decode" in the sense that most people think of it and it seems to be talked about here.
In fact, Intel doesn't even say it's 5-wide. Intel says it's "5 uops max". It's possible that that never means "five instructions" - it might be that you only get five uops when one or more of the decoders end up decoding an instruction into multiple uops.
For example, I'd love to see better decoding of read-modify-write instructions, but on Intel big-cores those are traditionally two uops, and they only get decoded by the first decoder. Even though from a decoding standpoint, the read-modify-write instructions are not at all any harder to decode than the normal load-op instructions. It's the exact same modrm format.
So it might well be that Skylake just extends the second decoder that used to only generate a single uop to also be able to emit two uops, so that you can decode two of those memory op instructions in the same cycle. Really, from the standpoint of just parsing the instruction bytes in memory, there is no difference between "add memory to register" and "add register to memory". The only difference is in the uops they result in.
So the "one more uop than Haswell" by no means needs to mean "one more instruction". It could just as easily (in fact, I think more easily) be about just making some of the decoders a bit more flexible.
There are other limits to the x86 decoders that tend to be more painful than "4 instructions". Intel used to have something like a 16-byte total size decode limit. I have some memory of that being extended to 32 bytes from the pure "fetch from L1 I$" standpoint, but there's the whole instruction re-alignment and predecode issue, and here were limits there just how many bytes the ostensibly four instructions could be.
Depending on what the exact rules are, again it might be much more productive to increase those kinds of limits rather than try to go from four instructions to five. Since branch targets are often not aligned, it can be a big deal if you can fetch a full 64 byte cacheline in one chunk and re-align it, because you're then more likely to get the full theoretical three or four instructions decoded after a mispredicted branch.
(And you don't need to do a full unaligned byte shifter for instruction decode - even if you fetch 64 bytes at a time, maybe you'll only align something like 16-24 bytes of them into the decode buffer in order to capture that "likely next four instructions" data)
Side note: I find it interesting that Skylake apparently fixes the "prefetch NULL" problem.
In list handling, you often want to blindly prefetch the next pointer (trying to make it conditional on being valid would just increase the overhead of prefetching to the point where it hurts more than it helps due to the inevitable branch misprediction at the end of the chain traversal), and almost every architecture I've seen gets this wrong, taking TLB or memory pipeline resources for the NULL case. Which is absolutely horrible. We've found prefetching to basically never be a win in real life because the cost of the prefetch is too high.
That may be something we might want to look at in the kernel if Skylake fixed prefetch.
Linus