150 GFLOP/s measured?

Article: PhysX87: Software Deficiency
By: gallier2 (gallier2.delete@this.gmx.de), July 23, 2010 4:58 am
Room: Moderated Discussions
a reader (a@b.c) on 7/22/10 wrote:
>anon (anon@anon.com) on 7/22/10 wrote:
>>Linus Torvalds (torvalds@linux-foundation.org) on 7/21/10 wrote:
>>>a reader (a@b.c) on 7/21/10 wrote:
>>>>odd things with store buffer replay? like what?
>>>>are you sure it's not just L1D too small?
>>>I think there were a few other cases of nasty replays too,
>>>and they really end up depending very subtly about just
>>>which cycle the different micro-ops got scheduled in. When
>>>code works well, the P4 runs like a greased bat out of
>>>hell, and then some very non-obvious things can make the
>>>almost identical code take replay traps all the time and
>>>just come to a crawling halt.
>>What made the P4 run so fast? Was it the high clock and ALU frequency?
>>I wonder if we'll see a return of the trace cache to improve the apparent efficiency
>>and/or wideness of the x86 fetch and decoder? Preferably with the normal L1I still
>>intact. I guess the loop detector is basically that, and probably will expand to handle more cases.
>p4 was designed for good micro-benchmark performance.
>gHz matter there.
>why do you want trace cache back? it worked in some
>academic cases. but it was probably a mistake to use
>it in production design.
>it's not easy to pin point why p4 perform so badly on real code. but Linus' explanation about the replays seems to be
>quite on the mark. it's possible that intel added some
>early rule-of-thumb checks on the replay path, which
>gave too many false positives, that would lead to excessive
>and unpredictable replays, hence the bad performance.
This article on xbitlab explains the replay loops of the P4.
It explains why when it breaks, it breaks down hard. Instructions that are scheduled get in the way of each other. On a normal pipelined architecture a stall will only slow down the execution, an instruction will only use the necessary ressource for its execution once. In the replay loop, instructions may reuse the same ressource over and over again, blocking that resource for other instructions, which will then be trapped themselves in the loops. I guess it's of polynomial complexity, i.e. the more instructions get hazards the more hazards are created for subsequent instructions.
