What sort of TPC-C numbers are you expecting for the 8/16/32P configurations? Higher than an HP Superdome, but less than IBM Power 5 servers?
Regarding TPC-C, you can check out slide #13 of the Hot Chips presentation on our scaling of OLTP type applications. These are our performance projections. Unfortunately, I can’t give you the actual TPC-C number since we don’t have the chips back yet from TSMC. However, you should be able to find TPC-C number for Opteron systems, and make inferences from there.
Figure 1 – OLTP Scaling for HORUS with Opteron
If I read the presentation right, your 8-way is going to be worse than a glueless 8-way with a scaling factor of about 4?
I think you are referring to slide 13 of my Hot Chips presentation (see Figure 1). You will notice the relative scaling between 8 single core sockets and a 2 quad HORUS system. Our performance simulations show that we are almost neck to neck. From the beginning our performance target was for a 2 quad 8 way to be better or no worse than hooking up 8 Opterons together. But we pull ahead significantly when we go to dual core.
Wait a second. Have you had a chance to look into the odd shape of that curve on slide 13? That is, scaling from 8P to 16P was better than from 4P to 8P?
Yes, that’s right. At a lower number of outstanding transactions, the added latency of HORUS has an impact. HORUS is built to support higher bandwidth and that is why we scale better as there are more outstanding transactions.
In a discussion with Rich Oehler (and I believe the email was forwarded to you), Mr. Oehler stated that loaded latency for two quads using HORUS is better than the latency of a glueless 8-way system. My question is whether this is solely due to the DIR/RDC of HORUS, or does HORUS also have a strategy to ensure a *stable*, loaded latency? That is to say, with many systems, as throughput approaches a maximum, latency explodes almost exponentially. Does HORUS have mechanisms to ensure a stable, maximum latency?
If you refer to slide 21 of my Hot Chips presentation you will see that an eight way system has two pairs of remote links for connections. So we believe from a raw bandwidth point of view (without RDC and DIR) we have sufficient resources there. There is a maximum limit to how many outstanding transactions an Opteron can support at a time, but most commercial applications don’t come close to that limit. So we believe we have enough raw bandwidth to handle all outstanding transaction in an eight way, without RDC and DIR. With RDC and DIR, we obviously will do much better. Of course, there isn’t anything preventing someone from writing a simple loop that will saturate the coherent HT links. Currently we don’t see much usage of coherent HT links for commercial applications.
According to your Hot Chips presentation, the RDC significantly improves performance (~5x). Are there exact figures for the latency of the RDC, order of magnitude figures?
Yes. We have a very accurate modeling of latencies and resources here in our performance model. We have very low latency when we hit in RDC. Also we have some support provided to us by AMD in Opteron that enable the transactions to be completed earlier when we hit in RDC.
Be the first to discuss this article!