By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), March 26, 2021 8:21 am
Room: Moderated Discussions
Moritz (better.delete@this.not.tell) on March 20, 2021 5:21 am wrote:
> What if you could completely rethink the general processor concept?
As already noted, this is a huge topic; it deserves a response on the scale of Donald Knuth's The Art of Computer Programming.
The microthread theme of exploiting (at least partially explicit) coarse-grained parallelism (including speculative parallelism) is one obvious point of attraction. (I do not recall anyone mentioning the decoupling of aspects of processing. Where a simple pipeline uses a single-entry buffer, modern pipelines provide larger buffers at various stages. Fetch and schedule are somewhat highly decoupled and data prefetch provides some decoupling of data load. While threading provides such decoupling, there are probably opportunities for decoupling where communication and synchronization/control-flow is still somewhat tightly integrated.)
Coordinating communication between general-purpose processing agents, accelerators, I/O agents, and storage/memory (and the interconnect) seems a significant design consideration. While memory-mapped I/O provides a useful abstraction (particularly for programming in a C-like language), better interfaces can likely be devised. Architecting interrupts as procedure calls from remote agents with arguments seems attractive; such would also provide an interface for inter-thread interrupts and communication. (Along similar lines, most uses of MWAIT would seem to benefit from loading the value in the newly changed memory location; this would not be a substantial benefit, but providing a separate explicit load operation seems wasteful. Note also that this mechanism ties with a thread stalled on a cache miss, which is effectively an MWAIT waiting for value return from memory not from another thread that has yet to store the value.)
Spatial (functional units, clusters, core groups, et al.) and temporal (lifetime) value locality can also be exploited in storage location and access mechanism. Some degree of random access for at least some operands might be avoided, saving area and power (latches vs. register file entries). As with cache coherence, intra-core communication has a broadcast-bias which is overkill for the common case. Many results have one temporally proximate consumer. Something vaguely reminiscent of Transport Triggered Architecture might both reduce communication (integrating such into operand-capture-style dynamic scheduling seems plausible). Partitioning can reduce latency and energy (if local accesses are sufficiently common); while cluster private caches have low utilization and/or replication/storage-waste issues, I am optimistic that software optimization could.
(Diverse criticality of data also seems to be underexploited. Academic papers have proposed cache replacement that take into account prefetchability and criticality for branch misprediction correction and pointer chasing, but even at L1 not all loads are equally urgent. A lax schedule load could use phased tag-data access [way prediction reduces this advantage] or yield its place on bank conflicts.)
There may be opportunities for scheduling and communication optimization from hoisting a set of loads in front of the computation. The cost of staging potentially unused data may not be high in some cases (and there is also a reduced decoupling).
(Cache access width and block size might be exploited for lower energy when nearby members of a structure are often accessed with temporal locality — cf. signature cache. This could also facilitated ECC with less read-modify-write overhead. If a part of an ECC word is read, the whole word could be cached if a modification is likely in the near future. Compile-time metadata might be worth providing for such cases. There may also be cases where L1 cache might act somewhat more like vector registers in caching gather operations for reuse or even just decoupling load [into-L1]/organization from operation.)
Since many resources are shared (even without multithreaded cores, cache capacity and bandwidth is typically shared at some level, memory bandwidth is shared, thermal headroom and power are shared) managing this sharing seems significant. There may be some use for a market-oriented bidding for resources, but monopoly grants may also have a place. The value of a resource to a consumer is not fixed by simple supply and demand but considers relative estimated utility to the use. (The overhead of managing budgets, bids, and other accounting would constrain how extensively market economics could be applied. There are also significant differences between a computer system and human systems.)
Coordinating architecture and microarchitecture applies the principle of work caching (do not put off until decode what can be done at compile-time). Caching all possible work is obviously foolish (e.g., having every instruction encoded in memory as the actual control signals without compression), but I believe a lot of work is unnecessarily redundant. E.g., the loading and storing of return addresses seems unnecessarily redundant with a return address stack predictor and a RAS predictor overflows and may not perfectly handle misspeculation. (Similarly, the redudancy between PC-relative branch BTB entries and code might be reduced with architectural help. This can be done microarchitecturally, but cooperation seems likely to be helpful.)
This is a huge topic; this post has not even explored all the first level of tangents from a few basic concepts.
> What if you could completely rethink the general processor concept?
As already noted, this is a huge topic; it deserves a response on the scale of Donald Knuth's The Art of Computer Programming.
The microthread theme of exploiting (at least partially explicit) coarse-grained parallelism (including speculative parallelism) is one obvious point of attraction. (I do not recall anyone mentioning the decoupling of aspects of processing. Where a simple pipeline uses a single-entry buffer, modern pipelines provide larger buffers at various stages. Fetch and schedule are somewhat highly decoupled and data prefetch provides some decoupling of data load. While threading provides such decoupling, there are probably opportunities for decoupling where communication and synchronization/control-flow is still somewhat tightly integrated.)
Coordinating communication between general-purpose processing agents, accelerators, I/O agents, and storage/memory (and the interconnect) seems a significant design consideration. While memory-mapped I/O provides a useful abstraction (particularly for programming in a C-like language), better interfaces can likely be devised. Architecting interrupts as procedure calls from remote agents with arguments seems attractive; such would also provide an interface for inter-thread interrupts and communication. (Along similar lines, most uses of MWAIT would seem to benefit from loading the value in the newly changed memory location; this would not be a substantial benefit, but providing a separate explicit load operation seems wasteful. Note also that this mechanism ties with a thread stalled on a cache miss, which is effectively an MWAIT waiting for value return from memory not from another thread that has yet to store the value.)
Spatial (functional units, clusters, core groups, et al.) and temporal (lifetime) value locality can also be exploited in storage location and access mechanism. Some degree of random access for at least some operands might be avoided, saving area and power (latches vs. register file entries). As with cache coherence, intra-core communication has a broadcast-bias which is overkill for the common case. Many results have one temporally proximate consumer. Something vaguely reminiscent of Transport Triggered Architecture might both reduce communication (integrating such into operand-capture-style dynamic scheduling seems plausible). Partitioning can reduce latency and energy (if local accesses are sufficiently common); while cluster private caches have low utilization and/or replication/storage-waste issues, I am optimistic that software optimization could.
(Diverse criticality of data also seems to be underexploited. Academic papers have proposed cache replacement that take into account prefetchability and criticality for branch misprediction correction and pointer chasing, but even at L1 not all loads are equally urgent. A lax schedule load could use phased tag-data access [way prediction reduces this advantage] or yield its place on bank conflicts.)
There may be opportunities for scheduling and communication optimization from hoisting a set of loads in front of the computation. The cost of staging potentially unused data may not be high in some cases (and there is also a reduced decoupling).
(Cache access width and block size might be exploited for lower energy when nearby members of a structure are often accessed with temporal locality — cf. signature cache. This could also facilitated ECC with less read-modify-write overhead. If a part of an ECC word is read, the whole word could be cached if a modification is likely in the near future. Compile-time metadata might be worth providing for such cases. There may also be cases where L1 cache might act somewhat more like vector registers in caching gather operations for reuse or even just decoupling load [into-L1]/organization from operation.)
Since many resources are shared (even without multithreaded cores, cache capacity and bandwidth is typically shared at some level, memory bandwidth is shared, thermal headroom and power are shared) managing this sharing seems significant. There may be some use for a market-oriented bidding for resources, but monopoly grants may also have a place. The value of a resource to a consumer is not fixed by simple supply and demand but considers relative estimated utility to the use. (The overhead of managing budgets, bids, and other accounting would constrain how extensively market economics could be applied. There are also significant differences between a computer system and human systems.)
Coordinating architecture and microarchitecture applies the principle of work caching (do not put off until decode what can be done at compile-time). Caching all possible work is obviously foolish (e.g., having every instruction encoded in memory as the actual control signals without compression), but I believe a lot of work is unnecessarily redundant. E.g., the loading and storing of return addresses seems unnecessarily redundant with a return address stack predictor and a RAS predictor overflows and may not perfectly handle misspeculation. (Similarly, the redudancy between PC-relative branch BTB entries and code might be reduced with architectural help. This can be done microarchitecturally, but cooperation seems likely to be helpful.)
This is a huge topic; this post has not even explored all the first level of tangents from a few basic concepts.