By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), October 11, 2020 6:16 am
Room: Moderated Discussions
Jeff S. (fakity.delete@this.fake.com) on October 8, 2020 12:16 pm wrote:
[snip]
> A pre-print from MICRO '20 later this month:
> Improving the Utilization of Micro-operation Caches in x86 Processors
>
I have not finished reading that paper (motivation and mental acuity have not been sufficiently present at the same time), but the opening line appears counter-factual: "Most modern processors employ variable length, Complex Instruction Set Computing (CISC) instructions to reduce instruction fetch energy cost and bandwidth requirements." That most modern processors employ variable length, CISC instructions is not true (unless, perhaps, one counts Thumb2 as CISC, but later statements tend to exclude that).
On a lesser point, both x86 and zArchitecture use CISC for legacy reasons; code density was an original motivation (and for x86 single byte instructions would have been helpful for early 8-bit memory interfaces), but if software compatibility (and ISA walls — patents, institutional knowledge, etc.) was not important neither x86 nor zArchitecture would be continued. (Thumb2 is not exactly an excellent encoding for its modern uses.)
(I think Renesas RX is the only commercial modern CISC. While CISC was chosen for code density, that was mainly for static code storage size not fetch energy or bandwidth; for microcontrollers and some other embedded systems static code size is very important.)
The claim that variable length encoding is incompatible with low(ish) overhead decode such that µop caches are needed seems to be contradicted by Zarchitecture implementations lacking µop caches (as far as I recall). x86 is not just byte-granular with 15 different sizes but also has somewhat complex length determination; it is not a good example of variable length (or variable work) instruction encoding. (I think variable length µop formats are likely to be beneficial in terms of storage cost and access energy. I suspect a distinct immediate storage area might not be worthwhile unless some other use of the storage provided higher/more balanced utilization — immediates sharing a µop cache line with base µops probably tends to balance utilization fairly well. The design space for µop caches seems large (and interesting).)
The storing of immediates in a heads-and-tails-like (Heidi Pan, "High Performance, Variable-Length Instruction Encodings", 2002 Master's thesis) manner was interesting — perhaps the 56-bit µop format excludes larger immediates. My initial thought was whether they compared to a grow toward the middle shared µop cache line, but having immediates and base µops grow toward the middle would make such impractical (I think).
Thank you for sharing the paper. I do hope to finish it soon, but I want to be able to give it proper attention.
[snip]
> A pre-print from MICRO '20 later this month:
> Improving the Utilization of Micro-operation Caches in x86 Processors
>
CLASP and Compaction (a) improve uop cache utilization/fetch ratio, dispatch bandwidth, average branch misprediction
> penalty and overall performance, and (b) reduce decoder power consumption. These optimizations combined improve
> performance by 5.3%, uop cache fetch ratio by 28.8% and dispatch bandwidth by 6.28%, while, reducing the decoder
> power consumption by 19.4% and branch misprediction latency by 5.23% in our workloads.
I have not finished reading that paper (motivation and mental acuity have not been sufficiently present at the same time), but the opening line appears counter-factual: "Most modern processors employ variable length, Complex Instruction Set Computing (CISC) instructions to reduce instruction fetch energy cost and bandwidth requirements." That most modern processors employ variable length, CISC instructions is not true (unless, perhaps, one counts Thumb2 as CISC, but later statements tend to exclude that).
On a lesser point, both x86 and zArchitecture use CISC for legacy reasons; code density was an original motivation (and for x86 single byte instructions would have been helpful for early 8-bit memory interfaces), but if software compatibility (and ISA walls — patents, institutional knowledge, etc.) was not important neither x86 nor zArchitecture would be continued. (Thumb2 is not exactly an excellent encoding for its modern uses.)
(I think Renesas RX is the only commercial modern CISC. While CISC was chosen for code density, that was mainly for static code storage size not fetch energy or bandwidth; for microcontrollers and some other embedded systems static code size is very important.)
The claim that variable length encoding is incompatible with low(ish) overhead decode such that µop caches are needed seems to be contradicted by Zarchitecture implementations lacking µop caches (as far as I recall). x86 is not just byte-granular with 15 different sizes but also has somewhat complex length determination; it is not a good example of variable length (or variable work) instruction encoding. (I think variable length µop formats are likely to be beneficial in terms of storage cost and access energy. I suspect a distinct immediate storage area might not be worthwhile unless some other use of the storage provided higher/more balanced utilization — immediates sharing a µop cache line with base µops probably tends to balance utilization fairly well. The design space for µop caches seems large (and interesting).)
The storing of immediates in a heads-and-tails-like (Heidi Pan, "High Performance, Variable-Length Instruction Encodings", 2002 Master's thesis) manner was interesting — perhaps the 56-bit µop format excludes larger immediates. My initial thought was whether they compared to a grow toward the middle shared µop cache line, but having immediates and base µops grow toward the middle would make such impractical (I think).
Thank you for sharing the paper. I do hope to finish it soon, but I want to be able to give it proper attention.
Topic | Posted By | Date |
---|---|---|
Zen 3 | Blue | 2020/10/08 09:58 AM |
Zen 3 | Rayla | 2020/10/08 10:10 AM |
Zen 3 | Adrian | 2020/10/08 10:13 AM |
Does anyone know whether Zen 3 has AVX-512? (NT) | Foo_ | 2020/10/08 11:54 AM |
Does anyone know whether Zen 3 has AVX-512? | Adrian | 2020/10/08 12:11 PM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 10:21 AM |
Zen 3 - Number of load/store units | Rayla | 2020/10/08 10:28 AM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 11:22 AM |
Zen 3 - Number of load/store units | Adrian | 2020/10/08 11:53 AM |
Zen 3 - Number of load/store units | Travis Downs | 2020/10/08 09:45 PM |
Zen 3 - CAD benchmark | Per Hesselgren | 2020/10/09 07:29 AM |
Zen 3 - CAD benchmark | Adrian | 2020/10/09 09:27 AM |
Zen 3 - Number of load/store units | itsmydamnation | 2020/10/08 02:38 PM |
Zen 3 - Number of load/store units | Groo | 2020/10/08 02:48 PM |
Zen 3 - Number of load/store units | Wilco | 2020/10/08 03:02 PM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/08 04:39 PM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 08:11 AM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/09 09:43 AM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 01:43 PM |
N7 and N7P are not load/Store units - please fix the topic in your replies (NT) | Heikki Kultala | 2020/10/10 07:37 AM |
Zen 3 | Jeff S. | 2020/10/08 12:16 PM |
Zen 3 | anon | 2020/10/08 01:57 PM |
Disappointing opening line in paper | Paul A. Clayton | 2020/10/11 06:16 AM |
Thoughts on "Improving the Utilization of µop Caches..." | Paul A. Clayton | 2020/10/14 12:11 PM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:56 AM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:57 AM |
Sorry about the mess | anon | 2020/10/15 11:58 AM |
Sorry about the mess | Brett | 2020/10/16 03:22 AM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/16 06:20 AM |
Caching dependence info in µop cache | anon | 2020/10/16 12:36 PM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/18 01:28 PM |
Zen 3 | juanrga | 2020/10/09 10:12 AM |
Zen 3 | Mr. Camel | 2020/10/09 06:30 PM |
Zen 3 | anon.1 | 2020/10/10 12:44 AM |
Cinebench is terrible benchmark | David Kanter | 2020/10/10 10:36 AM |
Cinebench is terrible benchmark | anon.1 | 2020/10/10 12:06 PM |
Cinebench is terrible benchmark | hobold | 2020/10/10 12:33 PM |
Some comments on benchmarks | Paul A. Clayton | 2020/10/14 12:11 PM |
Some comments on benchmarks | Mark Roulo | 2020/10/14 03:21 PM |
Zen 3 | Adrian | 2020/10/10 01:59 AM |
Zen 3 | Adrian | 2020/10/10 02:18 AM |
Zen 3 | majord | 2020/10/15 04:02 AM |
Zen 3 | hobold | 2020/10/10 08:58 AM |
Zen 3 | Maynard Handley | 2020/10/10 10:36 AM |
Zen 3 | hobold | 2020/10/10 12:19 PM |
Zen 3 | anon | 2020/10/11 02:58 AM |
Zen 3 | hobold | 2020/10/11 12:32 PM |
Zen 3 | anon | 2020/10/11 01:07 PM |
Zen 3 | hobold | 2020/10/11 02:22 PM |
Zen 3 | anon | 2020/10/10 11:51 AM |
Zen 3 | Michael S | 2020/10/11 01:16 AM |
Zen 3 | hobold | 2020/10/11 02:13 AM |
Zen 3 | Michael S | 2020/10/11 02:18 AM |
Zen 3 | anon.1 | 2020/10/11 12:17 PM |
Zen 3 | David Hess | 2020/10/12 06:43 AM |
more power? (NT) | anonymous2 | 2020/10/12 01:26 PM |
I think he's comparing 65W 3700X vs 105W 5800X (NT) | John H | 2020/10/12 04:33 PM |
?! Those are apples and oranges! (NT) | anon | 2020/10/12 04:49 PM |