By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), October 18, 2020 1:28 pm
Room: Moderated Discussions
anon (anon.delete@this.ymous.org) on October 16, 2020 12:36 pm wrote:
[snip]
> Thank you, I think I got it. I will point out that it could be possible to implement this directly at the I-Cache
> by rewriting a custom format on a line fill if the instruction format permits it.
Yes, and this would seem likely to be very messy for x86.
For an Icache one would also probably want to consider basic block boundaries. (One alternative might be to use fetch chunk alignment with simple reversal of the renaming simplification when jumping into the middle.)
The boundary between predecoded Icache and µop cache may be a bit fuzzy. (I suspect the boundary between trace cache and µop cache may also be fuzzy.)
> If not, then doing it in the
> uop cache does seem like the place, even though in practice it becomes hard to tell if it will impact cycle time
> or not. Essentially, another level of choice is added at Rename : Who uses the free list and who uses the RAT
> (for sources). It might not be a big deal but it means that each "lane" from Decode has to enable a different
> circuit whereas in the baseline each "lane" would go straight to the RAT period (for sources).
The free list readers would be necessary anyway (and since they are dependent operations they can be delayed).
With fetch based on a larger chunk that does not cross instruction boundaries (with trailing immediates not counted as part of an instruction), it seems one could rearrange instruction components to merge like portions if that simplified routing.
> Indexed wake up on the scheduler has been argued to be much more power efficient (e.g. TRIPS), but in this
> case, it would be added on top of the baseline broadcast wake up and to so it sounds like a hard sell to
> the circuit designer, especially for something that would only work if the source and the last destination
> are in the same dispatch bundle. Maybe an indexed write port used for dispatch could be stolen...
Result consumers tend to be close to result producers (for single cycle operations this has obvious benefits, compilation specifically targeting an out-of-order implementation might also increase proximity of consumer and producer), so (for some workloads) proximity might be a common case.
This might be a design aspect that works well with CAM-sharing designs intended to exploit variable numbers of not-ready operands; an index-triggered operation would be a zero-CAM-operand operation leaving two (or three) CAMs associated with a different operation. (Paired resource sharing might be practical, but one can imagine more funky sharing where each operand tag that can be allocated to one operation can be used by a different operation while still using physically adjacent connection.)
I think there was also research in using indexing for dependencies of operations dependent on a data cache miss (where there is time to look at dependencies and move tracking information around).
I do wonder if a limited dependency network (less general than all-to-all) that would map well to "typical" programs exists that would be justifiable by area and power savings. There should also be ways to exploit recurring schedules (similar to software pipelining of loops — this has been proposed academically) and correlated latencies (cache misses to the same cache block, cache misses for "old" or first-use misses — with I/O coming through the processor chip and potentially being cached, it is not clear anything could be classified as a first-use miss).
Reducing the amount of routing, comparison, and selection work may be a useful goal, but seeking to exploit higher level (e.g., task) parallelism might be more productive. (I do not think the metaphor of fighting the previous war quite applies, but priorities and tradeoffs do seem to shift.)
With arbitrary code, specialized design will typically introduce lower performance in some cases. Since the x86 market places significant value on performance compatibility specializations (or even just changing the 95% of existing code that runs well or how poorly the 5% runs) are probably less attractive.
[The above was a bit more digressive than usual even for me, but some of the lurkers might benefit a little.]
[snip]
> Thank you, I think I got it. I will point out that it could be possible to implement this directly at the I-Cache
> by rewriting a custom format on a line fill if the instruction format permits it.
Yes, and this would seem likely to be very messy for x86.
For an Icache one would also probably want to consider basic block boundaries. (One alternative might be to use fetch chunk alignment with simple reversal of the renaming simplification when jumping into the middle.)
The boundary between predecoded Icache and µop cache may be a bit fuzzy. (I suspect the boundary between trace cache and µop cache may also be fuzzy.)
> If not, then doing it in the
> uop cache does seem like the place, even though in practice it becomes hard to tell if it will impact cycle time
> or not. Essentially, another level of choice is added at Rename : Who uses the free list and who uses the RAT
> (for sources). It might not be a big deal but it means that each "lane" from Decode has to enable a different
> circuit whereas in the baseline each "lane" would go straight to the RAT period (for sources).
The free list readers would be necessary anyway (and since they are dependent operations they can be delayed).
With fetch based on a larger chunk that does not cross instruction boundaries (with trailing immediates not counted as part of an instruction), it seems one could rearrange instruction components to merge like portions if that simplified routing.
> Indexed wake up on the scheduler has been argued to be much more power efficient (e.g. TRIPS), but in this
> case, it would be added on top of the baseline broadcast wake up and to so it sounds like a hard sell to
> the circuit designer, especially for something that would only work if the source and the last destination
> are in the same dispatch bundle. Maybe an indexed write port used for dispatch could be stolen...
Result consumers tend to be close to result producers (for single cycle operations this has obvious benefits, compilation specifically targeting an out-of-order implementation might also increase proximity of consumer and producer), so (for some workloads) proximity might be a common case.
This might be a design aspect that works well with CAM-sharing designs intended to exploit variable numbers of not-ready operands; an index-triggered operation would be a zero-CAM-operand operation leaving two (or three) CAMs associated with a different operation. (Paired resource sharing might be practical, but one can imagine more funky sharing where each operand tag that can be allocated to one operation can be used by a different operation while still using physically adjacent connection.)
I think there was also research in using indexing for dependencies of operations dependent on a data cache miss (where there is time to look at dependencies and move tracking information around).
I do wonder if a limited dependency network (less general than all-to-all) that would map well to "typical" programs exists that would be justifiable by area and power savings. There should also be ways to exploit recurring schedules (similar to software pipelining of loops — this has been proposed academically) and correlated latencies (cache misses to the same cache block, cache misses for "old" or first-use misses — with I/O coming through the processor chip and potentially being cached, it is not clear anything could be classified as a first-use miss).
Reducing the amount of routing, comparison, and selection work may be a useful goal, but seeking to exploit higher level (e.g., task) parallelism might be more productive. (I do not think the metaphor of fighting the previous war quite applies, but priorities and tradeoffs do seem to shift.)
With arbitrary code, specialized design will typically introduce lower performance in some cases. Since the x86 market places significant value on performance compatibility specializations (or even just changing the 95% of existing code that runs well or how poorly the 5% runs) are probably less attractive.
[The above was a bit more digressive than usual even for me, but some of the lurkers might benefit a little.]
Topic | Posted By | Date |
---|---|---|
Zen 3 | Blue | 2020/10/08 09:58 AM |
Zen 3 | Rayla | 2020/10/08 10:10 AM |
Zen 3 | Adrian | 2020/10/08 10:13 AM |
Does anyone know whether Zen 3 has AVX-512? (NT) | Foo_ | 2020/10/08 11:54 AM |
Does anyone know whether Zen 3 has AVX-512? | Adrian | 2020/10/08 12:11 PM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 10:21 AM |
Zen 3 - Number of load/store units | Rayla | 2020/10/08 10:28 AM |
Zen 3 - Number of load/store units | ⚛ | 2020/10/08 11:22 AM |
Zen 3 - Number of load/store units | Adrian | 2020/10/08 11:53 AM |
Zen 3 - Number of load/store units | Travis Downs | 2020/10/08 09:45 PM |
Zen 3 - CAD benchmark | Per Hesselgren | 2020/10/09 07:29 AM |
Zen 3 - CAD benchmark | Adrian | 2020/10/09 09:27 AM |
Zen 3 - Number of load/store units | itsmydamnation | 2020/10/08 02:38 PM |
Zen 3 - Number of load/store units | Groo | 2020/10/08 02:48 PM |
Zen 3 - Number of load/store units | Wilco | 2020/10/08 03:02 PM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/08 04:39 PM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 08:11 AM |
Zen 3 - Number of load/store units | Dummond D. Slow | 2020/10/09 09:43 AM |
Zen 3 - Number of load/store units | Doug S | 2020/10/09 01:43 PM |
N7 and N7P are not load/Store units - please fix the topic in your replies (NT) | Heikki Kultala | 2020/10/10 07:37 AM |
Zen 3 | Jeff S. | 2020/10/08 12:16 PM |
Zen 3 | anon | 2020/10/08 01:57 PM |
Disappointing opening line in paper | Paul A. Clayton | 2020/10/11 06:16 AM |
Thoughts on "Improving the Utilization of µop Caches..." | Paul A. Clayton | 2020/10/14 12:11 PM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:56 AM |
Thoughts on "Improving the Utilization of µop Caches..." | anon | 2020/10/15 11:57 AM |
Sorry about the mess | anon | 2020/10/15 11:58 AM |
Sorry about the mess | Brett | 2020/10/16 03:22 AM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/16 06:20 AM |
Caching dependence info in µop cache | anon | 2020/10/16 12:36 PM |
Caching dependence info in µop cache | Paul A. Clayton | 2020/10/18 01:28 PM |
Zen 3 | juanrga | 2020/10/09 10:12 AM |
Zen 3 | Mr. Camel | 2020/10/09 06:30 PM |
Zen 3 | anon.1 | 2020/10/10 12:44 AM |
Cinebench is terrible benchmark | David Kanter | 2020/10/10 10:36 AM |
Cinebench is terrible benchmark | anon.1 | 2020/10/10 12:06 PM |
Cinebench is terrible benchmark | hobold | 2020/10/10 12:33 PM |
Some comments on benchmarks | Paul A. Clayton | 2020/10/14 12:11 PM |
Some comments on benchmarks | Mark Roulo | 2020/10/14 03:21 PM |
Zen 3 | Adrian | 2020/10/10 01:59 AM |
Zen 3 | Adrian | 2020/10/10 02:18 AM |
Zen 3 | majord | 2020/10/15 04:02 AM |
Zen 3 | hobold | 2020/10/10 08:58 AM |
Zen 3 | Maynard Handley | 2020/10/10 10:36 AM |
Zen 3 | hobold | 2020/10/10 12:19 PM |
Zen 3 | anon | 2020/10/11 02:58 AM |
Zen 3 | hobold | 2020/10/11 12:32 PM |
Zen 3 | anon | 2020/10/11 01:07 PM |
Zen 3 | hobold | 2020/10/11 02:22 PM |
Zen 3 | anon | 2020/10/10 11:51 AM |
Zen 3 | Michael S | 2020/10/11 01:16 AM |
Zen 3 | hobold | 2020/10/11 02:13 AM |
Zen 3 | Michael S | 2020/10/11 02:18 AM |
Zen 3 | anon.1 | 2020/10/11 12:17 PM |
Zen 3 | David Hess | 2020/10/12 06:43 AM |
more power? (NT) | anonymous2 | 2020/10/12 01:26 PM |
I think he's comparing 65W 3700X vs 105W 5800X (NT) | John H | 2020/10/12 04:33 PM |
?! Those are apples and oranges! (NT) | anon | 2020/10/12 04:49 PM |