By: Moritz (better.delete@this.not.tell), March 26, 2021 10:12 am
Room: Moderated Discussions
> if you could offer a CPU several times faster than anyone else's, people will notice.
I do not think that is possible, maybe the same with less energy and fewer transistors or somewhat faster in some cases, but never twice as fast in all cases. Some code is just sequential and there is nothing one can do about it.
> Dataflow is not a new idea. But other than the innards of large scale OoO implementations
That is what I am talking about.
> it doesn't work for real problems.
That is what the outer processor is for. It's job is to turn "real" code into Dataflow at run-time using info the compiler would not have had.
> Making that stuff statically explicit is hard. Really, really, really hard. Just consider IPF, and the vast resources thrown at compilers as part of that. While compilers can find some parallelism, most of it is, by experience, far to hard to find statically.
The compiler doesn't have to do it all in advance. It has to write a program that will finish the job at run-time. That is what the outer processor is for.
> How does having what's essentially a second CPU to compute addresses really help?
It does more than just the long addresses: load+stores, prefetch. It also does scatter and gather, vectorization, loop unrolling/removal, *pointer removal*, context switches, increases and decreases the number of active threads, run-time optimization.
It does everything to simplify the task and data it feeds to the inner data flow orientated processor and keep as many EUs/ports fed as possible with the available information. By preprocessing and dynamic/adaptive optimization at runtime it makes finding ILP in a given size of instruction window more likely.
> If nothing else, you've added the problems of communication between the main CPU and the addressing CPU.
True, but so what? I have (re)moved the problem of the inner processor having to look at hundreds of fine grained instructions filled with 64 bit references to extract ILP, with the outer doing it assisted by a compiler generated task specific program for each thread.
>On die caching could be handled explicitly in a 32 bit address space.
Scratchpads are a major PITA when you need a context switch. Caches actually handle that case quite well, and fairly automatically. Which is not to say that some fast on-die memory in a ccNUMA-ish configuration might not be useful - but that's fast main memory, still substantially slower than (most) cache speeds.
Right now the cache is addressed with values that are much longer than necessary and the cache must map them and search those maps.
That complexity could be hidden from the executing instance if the outer processor were responsible for mapping the inner and outer address spaces and providing the inner processor with short physical addresses to the processors shared cache.
I have not decided if the inner execution processor should use the shared memory or should be streamed to. The idea is just that it doesn't have to deal with the DRAM at all.
I do not think that is possible, maybe the same with less energy and fewer transistors or somewhat faster in some cases, but never twice as fast in all cases. Some code is just sequential and there is nothing one can do about it.
> Dataflow is not a new idea. But other than the innards of large scale OoO implementations
That is what I am talking about.
> it doesn't work for real problems.
That is what the outer processor is for. It's job is to turn "real" code into Dataflow at run-time using info the compiler would not have had.
> Making that stuff statically explicit is hard. Really, really, really hard. Just consider IPF, and the vast resources thrown at compilers as part of that. While compilers can find some parallelism, most of it is, by experience, far to hard to find statically.
The compiler doesn't have to do it all in advance. It has to write a program that will finish the job at run-time. That is what the outer processor is for.
> How does having what's essentially a second CPU to compute addresses really help?
It does more than just the long addresses: load+stores, prefetch. It also does scatter and gather, vectorization, loop unrolling/removal, *pointer removal*, context switches, increases and decreases the number of active threads, run-time optimization.
It does everything to simplify the task and data it feeds to the inner data flow orientated processor and keep as many EUs/ports fed as possible with the available information. By preprocessing and dynamic/adaptive optimization at runtime it makes finding ILP in a given size of instruction window more likely.
> If nothing else, you've added the problems of communication between the main CPU and the addressing CPU.
True, but so what? I have (re)moved the problem of the inner processor having to look at hundreds of fine grained instructions filled with 64 bit references to extract ILP, with the outer doing it assisted by a compiler generated task specific program for each thread.
>On die caching could be handled explicitly in a 32 bit address space.
Scratchpads are a major PITA when you need a context switch. Caches actually handle that case quite well, and fairly automatically. Which is not to say that some fast on-die memory in a ccNUMA-ish configuration might not be useful - but that's fast main memory, still substantially slower than (most) cache speeds.
Right now the cache is addressed with values that are much longer than necessary and the cache must map them and search those maps.
That complexity could be hidden from the executing instance if the outer processor were responsible for mapping the inner and outer address spaces and providing the inner processor with short physical addresses to the processors shared cache.
I have not decided if the inner execution processor should use the shared memory or should be streamed to. The idea is just that it doesn't have to deal with the DRAM at all.