By: David Kanter (dkanter.delete@this.realworldtech.com), January 27, 2017 6:46 am
Room: Moderated Discussions
wumpus (lost.delete@this.in-a.cave.net) on January 26, 2017 3:25 pm wrote:
> David Kanter (dkanter.delete@this.realworldtech.com) on January 26, 2017 7:33 am wrote:
> > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on January 26, 2017 3:15 am wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on January 25, 2017 6:10 pm wrote:
> > > > SATA controllers aren't exactly magical or high value. PCIe is often
> > > > quite tricky, especially new versions. 10GbE is pretty easy today.
> > >
> > > Plain 10GbE might be easy, but there's quite a bit of difference between a fully featured 10GbE
> > > NIC - with proper DCB, virtualization support, offloads, traffic steering and possibly a form
> > > of RDMA support such as iWARP or RoCE - and a baseline implementation. Additionally if you're
> > > integrating the controller on your SoC you probably want it to be able to tap into the processor
> > > caches directly with all the associated requirements on the interconnect.
> > >
> > > In short, I'm sure that IP for a 10GbE implementation is readily available, I'm
> > > not sure if it's on par with what's required for a proper server-side deployment.
> >
> > That's very much true, so thank you for pointing that out. To be a bit more explicit,
> > those are all features that are not useful in phones or client devices. So yet again,
> > the "we get high volumes from phones" fails to carryover into the data center.
> >
> > Also, a lot of that requires different cache controllers that are more intelligent than normal.
> >
> > David
>
> Isn't Intel the biggest server CPU supplier by both revenue and volume? It seems pretty weird
> to claim they aren't "server cores".
I think you are responding to an earlier post. Let me clarify what I mean.
Haswell is used in both client and server SoCs. It is a compromise between the two. Given the client volume (200-300M units/year), Intel cannot add in features which consume a lot of area or power for servers without benefiting clients.
OTOH, IBM's POWER and zArch lines are designed solely for servers, so I think it is instructive to consider some of the uarch differences. Oracle and Fujitsu are in a similar situation, but we can leave them aside for now.
The TLB is a good example. It is larger in servers do deal with larger data footprints. But TLBs are power-hungry and large.
POWER8 has >1K L2 TLBs. The z196 (which is old, the z12 and z13 are newer) has 1.5K L2 TLB entries, see http://www.realworldtech.com/z196-mainframe/6/
Large page support is important for servers, but not so much for clients. For example, the 1GB x86 pages are very good for virtualization and databases. Not most client workloads.
The cache hierarchy is another example. The L2 caches in the latest z13 are simply massive. The L2I and L2D are each 2MB and implemented in eDRAM for the arrays. The POWER8 has a smaller L2, but has an L3 where the local portion is fairly large (8MB).
These large caches reduce latency and also avoid injecting traffic on the fabric, thereby saving power.
>Even then I'd believe it to be pretty minimal for a server
> (although don't underestimate the ability of fast single-threaded code to do well under situations
> where Ahmdal's law is enforced more rigorously than you might expect). It would take a surprisingly
> strong core to bridge the gap that Intel so far can't bridge.
> I'm easily convinced that the design practices need to be too far apart to share much in either.
That's the big point I'm making. If we consider the overall server chip, the core can be shared (Intel has proven that), but other components cannot be shared. So there is still a lot of unique work. Maybe I should turn this into an article.
David
> David Kanter (dkanter.delete@this.realworldtech.com) on January 26, 2017 7:33 am wrote:
> > Gabriele Svelto (gabriele.svelto.delete@this.gmail.com) on January 26, 2017 3:15 am wrote:
> > > David Kanter (dkanter.delete@this.realworldtech.com) on January 25, 2017 6:10 pm wrote:
> > > > SATA controllers aren't exactly magical or high value. PCIe is often
> > > > quite tricky, especially new versions. 10GbE is pretty easy today.
> > >
> > > Plain 10GbE might be easy, but there's quite a bit of difference between a fully featured 10GbE
> > > NIC - with proper DCB, virtualization support, offloads, traffic steering and possibly a form
> > > of RDMA support such as iWARP or RoCE - and a baseline implementation. Additionally if you're
> > > integrating the controller on your SoC you probably want it to be able to tap into the processor
> > > caches directly with all the associated requirements on the interconnect.
> > >
> > > In short, I'm sure that IP for a 10GbE implementation is readily available, I'm
> > > not sure if it's on par with what's required for a proper server-side deployment.
> >
> > That's very much true, so thank you for pointing that out. To be a bit more explicit,
> > those are all features that are not useful in phones or client devices. So yet again,
> > the "we get high volumes from phones" fails to carryover into the data center.
> >
> > Also, a lot of that requires different cache controllers that are more intelligent than normal.
> >
> > David
>
> Isn't Intel the biggest server CPU supplier by both revenue and volume? It seems pretty weird
> to claim they aren't "server cores".
I think you are responding to an earlier post. Let me clarify what I mean.
Haswell is used in both client and server SoCs. It is a compromise between the two. Given the client volume (200-300M units/year), Intel cannot add in features which consume a lot of area or power for servers without benefiting clients.
OTOH, IBM's POWER and zArch lines are designed solely for servers, so I think it is instructive to consider some of the uarch differences. Oracle and Fujitsu are in a similar situation, but we can leave them aside for now.
The TLB is a good example. It is larger in servers do deal with larger data footprints. But TLBs are power-hungry and large.
POWER8 has >1K L2 TLBs. The z196 (which is old, the z12 and z13 are newer) has 1.5K L2 TLB entries, see http://www.realworldtech.com/z196-mainframe/6/
Large page support is important for servers, but not so much for clients. For example, the 1GB x86 pages are very good for virtualization and databases. Not most client workloads.
The cache hierarchy is another example. The L2 caches in the latest z13 are simply massive. The L2I and L2D are each 2MB and implemented in eDRAM for the arrays. The POWER8 has a smaller L2, but has an L3 where the local portion is fairly large (8MB).
These large caches reduce latency and also avoid injecting traffic on the fabric, thereby saving power.
>Even then I'd believe it to be pretty minimal for a server
> (although don't underestimate the ability of fast single-threaded code to do well under situations
> where Ahmdal's law is enforced more rigorously than you might expect). It would take a surprisingly
> strong core to bridge the gap that Intel so far can't bridge.
> I'm easily convinced that the design practices need to be too far apart to share much in either.
That's the big point I'm making. If we consider the overall server chip, the core can be shared (Intel has proven that), but other components cannot be shared. So there is still a lot of unique work. Maybe I should turn this into an article.
David