Separate the OoO-ness from speculative-ness

(To Linus: For gods sake, could you stop using those vulgarisms! Or are you inserting them into your texts on purpose to make your arguments appear stronger? I don't think it's working.)

It seems to me you two don't quite know what you are talking about. My point is that it is *not* about in-order vs. out-of-order, but rather about what OoO features does the CPU expose in the instruction set and in the documentation.

A non-speculative OoO CPU will always beat an in-order CPU (assuming the CPUs are otherwise identical (same number of execution units, same cache size, same memory bandwidth, etc)). NOTE that this theoretical OoO CPU I am mentioning here is *not* doing any *speculation* (except maybe those which also present in the in-order CPU). Once you have speculation, there by necessity exist some cases in which the CPU's prediction engine will make a wrong decision.

Even mere pipelining entails a small dose of speculative execution !!!

I will repeat it one more time, because it seems to me that you are completely missing this: Even mere pipelining entails a small dose of speculation !!!

What Ian Ollmann is criticizing is *not* the out-of-order-ness of the CPU but rather its speculative-ness. When he wrote "...something with a simpler execution engine, typically an in order machine [...] is far easier to take to close to peak performance..." he is simply wrong. The truth actually is that a programmer in assembly language will be able to take OoO CPU closer to peak performance (measured in IPC) rather than an in-order CPU. The reason for this is quite simple, but you both are missing it: the reason is that a non-speculative OoO CPU is designed to exploit certain *runtime-only* information when making decisions about which instruction(s) can be executed. On the other hand, a non-speculative in-order CPU is incapable of taking the runtime-only information into account.

The point is that the asm programmer does *not* have the runtime-only info at his/her disposal - which is quite obvious since the programmer is not running the program, the CPU is. The OoO CPU has knowledge (=truths) concerning certain pieces of the code that the programmer does not have.

The bad and confusing thing is that, in the x86 world, speculation and OoO are so mixed up with each other that one has no idea when the CPU is doing only speculation and when only executing out-of-order. It's so sad - the informal language we are using to describe e.g. a Nehalem makes little distinction between those two things. But they *are* two completely different/separate/othogonal things.

L1/L2/Ln cache obviously falls within the speculative category. Of course, if the cache has multiple ports thus allowing e.g. two parallel reads per clock, then this parallelism can be used by an OoO engine (if it happens to know for a fact that it can do two independent reads). But aside from these parallelisms, which are fully optional, the core idea of a cache is the speculative-ness - in other words, it is impossible to design a cache which would not entail a speculative element.

On other hand, I am claiming here it is possible to design an OoO CPU which does not entail any speculative element whatsoever. More precisely, such a CPU would appear to never make any misprediction - in cases in which it does not know what to do, it simply waits for the results, it never speculates the results. Of course, this pure OoO CPU would be noticeably slower when executing "typical" codes (e.g: Firefox, gzip, etc) than the same CPU with speculation.

(See Linus, I didn't use any vulgarism to make my argument. So behave yourself.)

