Scale up at Last?
Many industry insiders have commented that the K8 is eerily reminiscent of the ill-fated Alpha EV7. The EV7 augmented a high performance core with on-die directories for cache coherency and four interprocessor communication links operating at 6.4GB/s each. The EV7 also incorporated two memory controllers supporting eight channels of RDRAM, a total memory bandwidth of 12.8GB/s per processor.
Like the EV7, the K8 enhanced on a prior generation design; adding a memory controller, and three Hypertransport links. With three 8GB/s links, the K8 is an excellent choice for 1-8P servers. In theory, the K8 can scale up to 8 sockets; however, in practice it is extremely difficult. First, the only glueless 8 socket systems require multiple system boards; the Tyan Thunder K8QW uses 2 boards, while the Iwill H8502 uses 5 boards. Secondly, the snoop broadcast protocol used in the K8 ends up saturating the Hypertransport links. Third, using 8 sockets requires slightly more complicated system topologies that increase the number of hops between sockets and hence average memory latency. As a result, performance projections for commercial server workloads (OLTP in particular) show very poor gains (10-40% depending on which estimate) for glueless 8 socket systems over 4 socket systems.
AMD’s success with the Opteron for 4 socket systems, where they have roughly half the market, has prompted the architects to extend the K8L’s scalability a step further. The K8L will add an additional lane of 16 bit Hypertransport 3.0 to each device, providing 4 in total. Each link can run at up to 5.2GT/s, and can be split into two separate 8 bit links. So a single device could be configured with eight 8 bit Hypertranpsort links, instead of the regular four 16 bit links. Figure 1 below shows a fully connected system using split links.
Figure 2 – 4 and 8 Socket K8L System
Alternative configurations are also conceivable, but are beyond the scope of this article. Given these disclosures, the K8L will be somewhat more suitable to 8 socket systems, since it solves the topology and latency issues, although there is no disclosed solution for the snooping problem.
While AMD did not discuss the matter, we had initially hypothesized that the limit was still 8 nodes per system. It turns out that we were premature, and AMD has increased the number of nodes, although we do not know by how much. Any limitation is probably in the neighborhood of 16-64, both due to AMD’s partners, and technical constraints. AMD’s major partners: IBM, Sun and HP, all have highly scalable systems that use other architectures (PPC, SPARC and IPF, respectively). The notion of white box vendors selling high processor count systems would not sit well with any of those three, since the margins are far higher on their larger systems. Moreover, such a move could interfere with Newisys’ scalable Opteron servers. Lastly, scaling to above 16 sockets would require a significant investment of technical resources that would not improve, and could even detract from single device performance.
Ultimately, the 8 socket scaling for the K8L should substantially improve over the prior generation. Whether anyone will be willing to attempt glueless 16 sockets or more is certainly unclear, but it seems safe to say that 8 socket K8L systems will be quite compelling.
Discuss (18 comments)