By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), October 16, 2020 6:20 am
Room: Moderated Discussions
anon
(anon.delete@this.ymous.org) on October 15, 2020 11:56 am wrote:
[snip]
> Thank you for this detailed analysis. I haven't read the paper
thoroughly yet, but I wanted
> to discuss one of your comments about reordering uops and rename
optimizations.
>
> Regarding reordering, the problem here is that is that you cannot
generally rename out-of-order because
> although this might not have any impact at first glance (What's the
difference between "add rax, rcx;
> ld rbx, [rdx], add r12, rax" and "ld rbx, [rdx], add rax, rcx; add
r12, rax"?), I think it it gets messy
> if you want precise exceptions/interruptions. So, you probably can
rename out-of-order but you need
> to map those out-of-order mappings back to a ROB-like structure
that is allocated earlier than rename
> in the pipeline and it might be weird. However, that is an
interesting thought because I know of some
> designs where the RAT is port-limited and so rename groups with at
most x reads/writes have to be formed,
> which may not match what is coming out of Decode (compiler could
help though).
Since exceptions are exceptional, one could do replay from the Icache (assuming data-inclusive Icache, which seems typical). This is vaguely similar to one of the POWER implementations (POWER5?) using replay with single-operation bundles on exceptions so that ROB overhead could be reduced. If the Icache was not data inclusive of the µop cache (as opposed to just tag inclusive to support snooping), the untangling of µop order would require extra information and be a bit complex.
> On the rename optimization thing ("rewriting"). I am not
> sure I followed the idea. Could you please elaborate?
With respect to renaming, if a source operand is the destination of a previous µop, one can replace that register name with the number of the µop that provides the value (assuming single-result µops). When renaming, those sources would not read the RAT but the free list to get their new name. This is just caching the dependence information; detecting dependencies before RAT access would increase latency (yet allow reading from the free list and not the RAT), detecting them in parallel with RAT access would increase RAT port demand — caching the information provides the benefit of the former without the latency cost (and could save some energy as well).
(One could also imagine other optimizations, some of which would depend on the design of the scheduler. For example, a scheduler might use indexed wake-up rather than broadcast comparison when one operation is known to provide the last-to-be-available source of another operation; if replay is cheap, predicted as last-to-be-available might suffice.)
(anon.delete@this.ymous.org) on October 15, 2020 11:56 am wrote:
[snip]
> Thank you for this detailed analysis. I haven't read the paper
thoroughly yet, but I wanted
> to discuss one of your comments about reordering uops and rename
optimizations.
>
> Regarding reordering, the problem here is that is that you cannot
generally rename out-of-order because
> although this might not have any impact at first glance (What's the
difference between "add rax, rcx;
> ld rbx, [rdx], add r12, rax" and "ld rbx, [rdx], add rax, rcx; add
r12, rax"?), I think it it gets messy
> if you want precise exceptions/interruptions. So, you probably can
rename out-of-order but you need
> to map those out-of-order mappings back to a ROB-like structure
that is allocated earlier than rename
> in the pipeline and it might be weird. However, that is an
interesting thought because I know of some
> designs where the RAT is port-limited and so rename groups with at
most x reads/writes have to be formed,
> which may not match what is coming out of Decode (compiler could
help though).
Since exceptions are exceptional, one could do replay from the Icache (assuming data-inclusive Icache, which seems typical). This is vaguely similar to one of the POWER implementations (POWER5?) using replay with single-operation bundles on exceptions so that ROB overhead could be reduced. If the Icache was not data inclusive of the µop cache (as opposed to just tag inclusive to support snooping), the untangling of µop order would require extra information and be a bit complex.
> On the rename optimization thing ("rewriting"). I am not
> sure I followed the idea. Could you please elaborate?
With respect to renaming, if a source operand is the destination of a previous µop, one can replace that register name with the number of the µop that provides the value (assuming single-result µops). When renaming, those sources would not read the RAT but the free list to get their new name. This is just caching the dependence information; detecting dependencies before RAT access would increase latency (yet allow reading from the free list and not the RAT), detecting them in parallel with RAT access would increase RAT port demand — caching the information provides the benefit of the former without the latency cost (and could save some energy as well).
(One could also imagine other optimizations, some of which would depend on the design of the scheduler. For example, a scheduler might use indexed wake-up rather than broadcast comparison when one operation is known to provide the last-to-be-available source of another operation; if replay is cheap, predicted as last-to-be-available might suffice.)
Topic | Posted By | Date |
---|---|---|
Zen 3 | Blue | 2020/10/08 09:58 AM |
Zen 3 | Rayla | 2020/10/08 10:10 AM |
Zen 3 | Adrian | 2020/10/08 10:13 AM |
Does anyone know whether Zen 3 has AVX-512? (NT) | Foo_ | 2020/10/08 11:54 AM |
Does anyone know whether Zen 3 has AVX-512? | Adrian | 2020/10/08 12:11 PM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 10:21 AM |
Zen 3 - Number of load/store units | Rayla | 2020/10/08 10:28 AM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 11:22 AM |
Zen 3 - Number of load/store units | Adrian | 2020/10/08 11:53 AM |
Zen 3 - Number of load/store units | Travis Downs | 2020/10/08 09:45 PM |
Zen 3 - CAD benchmark | Per Hesselgren | 2020/10/09 07:29 AM |
Zen 3 - CAD benchmark | Adrian | 2020/10/09 09:27 AM |
Zen 3 - Number of load/store units | itsmydamnation | 2020/10/08 02:38 PM |
Zen 3 - Number of load/store units | Groo | 2020/10/08 02:48 PM |
Zen 3 - Number of load/store units | Wilco | 2020/10/08 03:02 PM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/08 04:39 PM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 08:11 AM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/09 09:43 AM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 01:43 PM |
N7 and N7P are not load/Store units - please fix the topic in your replies (NT) | Heikki Kultala | 2020/10/10 07:37 AM |
Zen 3 | Jeff S. | 2020/10/08 12:16 PM |
Zen 3 | anon | 2020/10/08 01:57 PM |
Disappointing opening line in paper | Paul A. Clayton | 2020/10/11 06:16 AM |
Thoughts on "Improving the Utilization of µop Caches..." | Paul A. Clayton | 2020/10/14 12:11 PM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:56 AM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:57 AM |
Sorry about the mess | anon | 2020/10/15 11:58 AM |
Sorry about the mess | Brett | 2020/10/16 03:22 AM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/16 06:20 AM |
Caching dependence info in µop cache | anon | 2020/10/16 12:36 PM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/18 01:28 PM |
Zen 3 | juanrga | 2020/10/09 10:12 AM |
Zen 3 | Mr. Camel | 2020/10/09 06:30 PM |
Zen 3 | anon.1 | 2020/10/10 12:44 AM |
Cinebench is terrible benchmark | David Kanter | 2020/10/10 10:36 AM |
Cinebench is terrible benchmark | anon.1 | 2020/10/10 12:06 PM |
Cinebench is terrible benchmark | hobold | 2020/10/10 12:33 PM |
Some comments on benchmarks | Paul A. Clayton | 2020/10/14 12:11 PM |
Some comments on benchmarks | Mark Roulo | 2020/10/14 03:21 PM |
Zen 3 | Adrian | 2020/10/10 01:59 AM |
Zen 3 | Adrian | 2020/10/10 02:18 AM |
Zen 3 | majord | 2020/10/15 04:02 AM |
Zen 3 | hobold | 2020/10/10 08:58 AM |
Zen 3 | Maynard Handley | 2020/10/10 10:36 AM |
Zen 3 | hobold | 2020/10/10 12:19 PM |
Zen 3 | anon | 2020/10/11 02:58 AM |
Zen 3 | hobold | 2020/10/11 12:32 PM |
Zen 3 | anon | 2020/10/11 01:07 PM |
Zen 3 | hobold | 2020/10/11 02:22 PM |
Zen 3 | anon | 2020/10/10 11:51 AM |
Zen 3 | Michael S | 2020/10/11 01:16 AM |
Zen 3 | hobold | 2020/10/11 02:13 AM |
Zen 3 | Michael S | 2020/10/11 02:18 AM |
Zen 3 | anon.1 | 2020/10/11 12:17 PM |
Zen 3 | David Hess | 2020/10/12 06:43 AM |
more power? (NT) | anonymous2 | 2020/10/12 01:26 PM |
I think he's comparing 65W 3700X vs 105W 5800X (NT) | John H | 2020/10/12 04:33 PM |
?! Those are apples and oranges! (NT) | anon | 2020/10/12 04:49 PM |