By: rwessel (rwessel.delete@this.yahoo.com), March 21, 2021 11:23 am
Room: Moderated Discussions
Moritz (better.delete@this.not.tell) on March 21, 2021 9:45 am wrote:
> I read/understood so far:
>
> # Architecture independent intermediate representation that
> frees the HW from removing legacy HW specific constructs
> # Less implicit constrains by using HLL that is about WHAT but not HOW.
> # Explicit "I do not care about ..." annotations for when the language does not allow otherwise
> # The architecture must not force the program to specify operations that do not
> generate output-results. These might be control flow or resource use related
> # Microthreading by short sequential blocks of instructions with delimiter
> # Low latency and granularity offloading to configurable long and wide data path from the last level cache.
>
> I was talking about a general purpose architecture that is not just like some historic non Intel
> architecture and that is not as parallel and restricted as a GPU. I was not talking about highly
> speculative execution to process inherently serial code either. I was not talking about an architecture
> compatible with existing low level source code. If there is no inner thread parallelism near
> nor far, then one can still make the processor more energy efficient.
>
> How would an engineer with zero knowledge of the past, but
> four billion transistors at his/her disposal do it today?
> Such a person would not even know the concept of a single core single thread RISC von-Neumann-architecture.
> Why even feed a single type of instruction stream / thread into the CPU?
It works. And it's not like people haven't been exploring radically different architectures since roughly the dawn of computing. Obviously inertia is an issue, but if you could offer a CPU several times faster than anyone else's, people will notice.
>The compiler could discompose the program
> into a front-end+back-end part that interfaces with the 64 bit memory and execution resources and a part that
> configures the execution resources to process the data that it gets fed by the first part.
Dataflow is not a new idea. But other than the innards of large scale OoO implementations, it doesn't work for real problems. Not quite dataflow, but certainly leaning that way more than a bit, maybe the Mill guys will end up succeeding.
>Many tasks/functions
> that are handled in hardware today have to guess what is going on could be made explicit.
Making that stuff statically explicit is hard. Really, really, really hard. Just consider IPF, and the vast resources thrown at compilers as part of that. While compilers can find some parallelism, most of it is, by experience, far to hard to find statically. And what high-end ISA doesn't have prefetch instructions? Which most of the time make things worse.
>The inner (EU) part
> of the CPU would not have to compute 64 bit addresses if it did not have to know/handle where the data is coming
> from and going to.
How does having what's essentially a second CPU to compute addresses really help? If nothing else, you've added the problems of communication between the main CPU and the addressing CPU.
>On die caching could be handled explicitly in a 32 bit address space.
Scratchpads are a major PITA when you need a context switch. Caches actually handle that case quite well, and fairly automatically. Which is not to say that some fast on-die memory in a ccNUMA-ish configuration might not be useful - but that's fast main memory, still substantially slower than (most) cache speeds.
> I read/understood so far:
>
> # Architecture independent intermediate representation that
> frees the HW from removing legacy HW specific constructs
> # Less implicit constrains by using HLL that is about WHAT but not HOW.
> # Explicit "I do not care about ..." annotations for when the language does not allow otherwise
> # The architecture must not force the program to specify operations that do not
> generate output-results. These might be control flow or resource use related
> # Microthreading by short sequential blocks of instructions with delimiter
> # Low latency and granularity offloading to configurable long and wide data path from the last level cache.
>
> I was talking about a general purpose architecture that is not just like some historic non Intel
> architecture and that is not as parallel and restricted as a GPU. I was not talking about highly
> speculative execution to process inherently serial code either. I was not talking about an architecture
> compatible with existing low level source code. If there is no inner thread parallelism near
> nor far, then one can still make the processor more energy efficient.
>
> How would an engineer with zero knowledge of the past, but
> four billion transistors at his/her disposal do it today?
> Such a person would not even know the concept of a single core single thread RISC von-Neumann-architecture.
> Why even feed a single type of instruction stream / thread into the CPU?
It works. And it's not like people haven't been exploring radically different architectures since roughly the dawn of computing. Obviously inertia is an issue, but if you could offer a CPU several times faster than anyone else's, people will notice.
>The compiler could discompose the program
> into a front-end+back-end part that interfaces with the 64 bit memory and execution resources and a part that
> configures the execution resources to process the data that it gets fed by the first part.
Dataflow is not a new idea. But other than the innards of large scale OoO implementations, it doesn't work for real problems. Not quite dataflow, but certainly leaning that way more than a bit, maybe the Mill guys will end up succeeding.
>Many tasks/functions
> that are handled in hardware today have to guess what is going on could be made explicit.
Making that stuff statically explicit is hard. Really, really, really hard. Just consider IPF, and the vast resources thrown at compilers as part of that. While compilers can find some parallelism, most of it is, by experience, far to hard to find statically. And what high-end ISA doesn't have prefetch instructions? Which most of the time make things worse.
>The inner (EU) part
> of the CPU would not have to compute 64 bit addresses if it did not have to know/handle where the data is coming
> from and going to.
How does having what's essentially a second CPU to compute addresses really help? If nothing else, you've added the problems of communication between the main CPU and the addressing CPU.
>On die caching could be handled explicitly in a 32 bit address space.
Scratchpads are a major PITA when you need a context switch. Caches actually handle that case quite well, and fairly automatically. Which is not to say that some fast on-die memory in a ccNUMA-ish configuration might not be useful - but that's fast main memory, still substantially slower than (most) cache speeds.