By: rwessel (rwessel.delete@this.yahoo.com), March 21, 2021 5:02 am
Room: Moderated Discussions
Moritz (better.delete@this.not.tell) on March 20, 2021 5:21 am wrote:
> What if you could completely rethink the general processor concept?
> There are concepts that were without alternative in the days of little memory and few transistors:
> Sequential instructions by storage address and jumps based on that address
> Implicit dependency based on above principle
> Explicit naming of storage place rather than data item
> Explicit caching into registers
> Implicit addressing of registers
> Mixing of memory, float, integer instructions in one instruction stream
> that must be analyzed to remove the assumed sequentiallity.
> The ISA used to represent the physical architecture, today that
> is no longer the case in high performance microprocessors.
> The data modifies the program flow at run-time, instead of explicitly generating the data stream
> that reaches the execution units. The CPU steps through the program issuing the data to EUs instead
> of the program explicitly generating multiple data streams with synchronization markers.
> ... and many other implications that are so "natural" to us that we can not see/name them. As usual
> we can not even question the ways, because we are so used to them. There are infinite bad ways of doing
> it, but some of those forced/obvious (legacy) design decisions of the past might no longer be that
> necessary/without alternative. Some ways that seem cumbersome and wasteful might on second thought
> turn out to be hard on the human, but open new ways to the compiler, RTE, OS, CPU removing as much
> complexity as they add, but increasing throughput or energy efficiency beyond the current limit.
To a large extent none of that matters. Instruction oriented ISA (mis)features at best/worst are going to impact performance and power consumption by a few tens of percent, most of the time considerably less than that. Even within an ISA it's not that hard to deprecate a particularly poor feature - just punt it to slow microcode, and people will stop using it. As fewer people use it, punt it to even slower microcode, with essentially no implementation overhead.
We will get some increases in performances as chip technology improves, but we'll largely get that anyway, and I rather doubt that many people are expecting large increases there. We've managed what, a factor of ten in single core integer performance the last 15 years? And half of that from compilers painfully clawing out a bit of autovectorization on semi-hacked benchmarks. While spending something like 100 times as many transistor per core?
Let's quote Gene Amdahl (1967):
We've now reached the point that the "continued validity of the single processor approach" has been pretty solidly demonstrated to be false.
Which means that if there's any real hope for large increases in performance, it has to come form parallelism. We have mechanisms for some EP problems now (GPGPUs, vectors), and for EP-ish problems with limited communication between threads (standard multiple processors, to clusters, for code requiring really limited communication).
But if you're fortunate enough to have an EP problem, a few tens of percent difference in single core performance mostly doesn't matter anyway.
Of course we've been failing, pretty miserably (outside some limited areas), to generally use parallelism for better than four decades now. For the data-parallel approaches (vectors, etc.), it's even worse.
One of the biggest problems with multi-threaded code is the very high overhead of IPC. It make it largely impossible to split off small units of work. It make it really hard to split off medium sized units of work. And if all you can actually split off is large units of work, we're back to handling only EP problems, except for thing where we can justify heroics.
Fundamentally there's literally no room in any of that for the OS. Similar to how software TLB fills are a really bad idea, CPUs need to handle scheduling, intercontext calls, synchronization objects, interrupts and the like, *without* the OS getting involved (the OS would obviously be involved in setup). Thus CPUs need to hold multiple contexts (think a "TLB" for contexts), and be able to switch between them rapidly (IOW, SMT). A signal or message to a blocked thread whose context is held by the current core should dispatch in a couple of cycles; across the system, in about the time a cache coherency event needs to traverse the network). At the very small level, something like microthreads, with dispatch and synchronization times on the order of a cycle. It should be reasonable to throw a fork-microthread/wait-microthread around a dozen instructions).
Many sorts of I/O need to be possible at the user level, which means something like IOMMUs, smart enough that the OS can set it up for a process, and then avoid getting involved in the actual I/Os (interrupt/signal delivery from the I/O device shouldn't involve the OS either). That doesn't preclude devices that aren't capable of working in that environment, but handling that would be no worse than what we have now.
On the other hand it's not clear that this doesn't fall right back into the "magic compiler" trap, but parallelism is the only thing that might save us.
> What if you could completely rethink the general processor concept?
> There are concepts that were without alternative in the days of little memory and few transistors:
> Sequential instructions by storage address and jumps based on that address
> Implicit dependency based on above principle
> Explicit naming of storage place rather than data item
> Explicit caching into registers
> Implicit addressing of registers
> Mixing of memory, float, integer instructions in one instruction stream
> that must be analyzed to remove the assumed sequentiallity.
> The ISA used to represent the physical architecture, today that
> is no longer the case in high performance microprocessors.
> The data modifies the program flow at run-time, instead of explicitly generating the data stream
> that reaches the execution units. The CPU steps through the program issuing the data to EUs instead
> of the program explicitly generating multiple data streams with synchronization markers.
> ... and many other implications that are so "natural" to us that we can not see/name them. As usual
> we can not even question the ways, because we are so used to them. There are infinite bad ways of doing
> it, but some of those forced/obvious (legacy) design decisions of the past might no longer be that
> necessary/without alternative. Some ways that seem cumbersome and wasteful might on second thought
> turn out to be hard on the human, but open new ways to the compiler, RTE, OS, CPU removing as much
> complexity as they add, but increasing throughput or energy efficiency beyond the current limit.
To a large extent none of that matters. Instruction oriented ISA (mis)features at best/worst are going to impact performance and power consumption by a few tens of percent, most of the time considerably less than that. Even within an ISA it's not that hard to deprecate a particularly poor feature - just punt it to slow microcode, and people will stop using it. As fewer people use it, punt it to even slower microcode, with essentially no implementation overhead.
We will get some increases in performances as chip technology improves, but we'll largely get that anyway, and I rather doubt that many people are expecting large increases there. We've managed what, a factor of ten in single core integer performance the last 15 years? And half of that from compilers painfully clawing out a bit of autovectorization on semi-hacked benchmarks. While spending something like 100 times as many transistor per core?
Let's quote Gene Amdahl (1967):
For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.
Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple processor approach in terms of application to real problems and their attendant irregularities.
We've now reached the point that the "continued validity of the single processor approach" has been pretty solidly demonstrated to be false.
Which means that if there's any real hope for large increases in performance, it has to come form parallelism. We have mechanisms for some EP problems now (GPGPUs, vectors), and for EP-ish problems with limited communication between threads (standard multiple processors, to clusters, for code requiring really limited communication).
But if you're fortunate enough to have an EP problem, a few tens of percent difference in single core performance mostly doesn't matter anyway.
Of course we've been failing, pretty miserably (outside some limited areas), to generally use parallelism for better than four decades now. For the data-parallel approaches (vectors, etc.), it's even worse.
One of the biggest problems with multi-threaded code is the very high overhead of IPC. It make it largely impossible to split off small units of work. It make it really hard to split off medium sized units of work. And if all you can actually split off is large units of work, we're back to handling only EP problems, except for thing where we can justify heroics.
Fundamentally there's literally no room in any of that for the OS. Similar to how software TLB fills are a really bad idea, CPUs need to handle scheduling, intercontext calls, synchronization objects, interrupts and the like, *without* the OS getting involved (the OS would obviously be involved in setup). Thus CPUs need to hold multiple contexts (think a "TLB" for contexts), and be able to switch between them rapidly (IOW, SMT). A signal or message to a blocked thread whose context is held by the current core should dispatch in a couple of cycles; across the system, in about the time a cache coherency event needs to traverse the network). At the very small level, something like microthreads, with dispatch and synchronization times on the order of a cycle. It should be reasonable to throw a fork-microthread/wait-microthread around a dozen instructions).
Many sorts of I/O need to be possible at the user level, which means something like IOMMUs, smart enough that the OS can set it up for a process, and then avoid getting involved in the actual I/Os (interrupt/signal delivery from the I/O device shouldn't involve the OS either). That doesn't preclude devices that aren't capable of working in that environment, but handling that would be no worse than what we have now.
On the other hand it's not clear that this doesn't fall right back into the "magic compiler" trap, but parallelism is the only thing that might save us.