By: Paul A. Clayton (paaronclayton.delete@this.gmail.com), January 30, 2013 12:42 pm
Room: Moderated Discussions
bakaneko (nyan.delete@this.hyan.wan) on January 30, 2013 1:32 am wrote:
> carop (carop.delete@this.somewhere.net) on January 29, 2013 2:55 pm wrote:
> >
> > http://www.wired.com/wiredenterprise/2013/01/facebook-arm-chips/
>
> Thanks, facebook saying they need only half a
> meg of cache, not 2 per core is an interesting
> data point.
(By the way, the article indicated performance degraded at 512KiB: "Speed didn’t degrade until they took the cache all the way down to a half megabyte." Whether 512KiB is a sweet spot or not is not clear; a mild degradation at 512KiB might be a worthwhile tradeoff for more computation capability.)
The method of measuring cache capacity benefits seems to be flawed. For a workload with a lot of streaming accesses, the indexing, allocation, and replacement choices of common caches would significantly reduce the benefits of larger caches. E.g., a 512KiB stream of data would replace 512KiB of cache content in a 8MiB 16-way cache when the equivalent of a 8KiB stream buffer might suffice and allow hundreds of KiB of non-stream data/code to be retained. For some workloads replacement based on Least Frequently Used works much better than LRU-based replacement. Tradeoffs in replication, migration, inclusion, etc. would also seem likely to vary among different workloads.
The test also did not check if there was a knee above 3MiB where data reuse jumped (such seems unlikely for that workload, but some workloads might have a increase in reuse at very high capacity).
(I am curious how they managed to disable portions of the cache. Although some high-end servers support selective block disabling as a standard feature, I was not aware of any that supporting disabling half, a third, or a sixth of the cache. Does Facebook have access to special features that are typically fused-off? Restricting such to Facebook seems to be a disservice to others--e.g., academic researchers--who could benefit from access to such features.)
The tested Facebook workload may very well intrinsically not reuse memory contents with sufficient temporal locality to justify a largish cache. I am under the impression that the typical workload is very cluster-friendly (minimal and well-defined communication--also with an emphasis on software-based reliability) with relatively little data reuse (at least below GiB scales). However, there may still be advantages to large on-chip/in-package memories (e.g., dictionary-based compression might be used to reduce bandwidth requirements [keeping the dictionary in fast, high-bandwidth memory would be useful], computation vs. communication/storage tradeoffs might favor larger active code footprints). If workload mixing was practical, a workload that benefited from more cache (or other resource) might share a chip/system with a workload that did not so benefit (assuming resource allocation could be handled somewhat intelligently).
It seems that it would be possible for a single chip to have substantial configurability (including scratchpad and/or software-managed cache) such that a largish memory capacity could still be useful for a broad range of workloads. Whether the economics of such a (somewhat) more general-purpose design would overcome the benefits of specialization for the somewhat high potential volume of that subset of the cloud workloads is far beyond my abilities to even reasonably guess.
I would like to see greater innovation in chip design, but I am skeptical that smaller caches and more processing elements are necessarily a good fit (though the UltraSPARC T series might just point to how ignorant I am in this regard). (Of course, the tradeoffs could change if dense persistent logic-compatible memory becomes practical.)
> carop (carop.delete@this.somewhere.net) on January 29, 2013 2:55 pm wrote:
> >
> > http://www.wired.com/wiredenterprise/2013/01/facebook-arm-chips/
>
> Thanks, facebook saying they need only half a
> meg of cache, not 2 per core is an interesting
> data point.
(By the way, the article indicated performance degraded at 512KiB: "Speed didn’t degrade until they took the cache all the way down to a half megabyte." Whether 512KiB is a sweet spot or not is not clear; a mild degradation at 512KiB might be a worthwhile tradeoff for more computation capability.)
The method of measuring cache capacity benefits seems to be flawed. For a workload with a lot of streaming accesses, the indexing, allocation, and replacement choices of common caches would significantly reduce the benefits of larger caches. E.g., a 512KiB stream of data would replace 512KiB of cache content in a 8MiB 16-way cache when the equivalent of a 8KiB stream buffer might suffice and allow hundreds of KiB of non-stream data/code to be retained. For some workloads replacement based on Least Frequently Used works much better than LRU-based replacement. Tradeoffs in replication, migration, inclusion, etc. would also seem likely to vary among different workloads.
The test also did not check if there was a knee above 3MiB where data reuse jumped (such seems unlikely for that workload, but some workloads might have a increase in reuse at very high capacity).
(I am curious how they managed to disable portions of the cache. Although some high-end servers support selective block disabling as a standard feature, I was not aware of any that supporting disabling half, a third, or a sixth of the cache. Does Facebook have access to special features that are typically fused-off? Restricting such to Facebook seems to be a disservice to others--e.g., academic researchers--who could benefit from access to such features.)
The tested Facebook workload may very well intrinsically not reuse memory contents with sufficient temporal locality to justify a largish cache. I am under the impression that the typical workload is very cluster-friendly (minimal and well-defined communication--also with an emphasis on software-based reliability) with relatively little data reuse (at least below GiB scales). However, there may still be advantages to large on-chip/in-package memories (e.g., dictionary-based compression might be used to reduce bandwidth requirements [keeping the dictionary in fast, high-bandwidth memory would be useful], computation vs. communication/storage tradeoffs might favor larger active code footprints). If workload mixing was practical, a workload that benefited from more cache (or other resource) might share a chip/system with a workload that did not so benefit (assuming resource allocation could be handled somewhat intelligently).
It seems that it would be possible for a single chip to have substantial configurability (including scratchpad and/or software-managed cache) such that a largish memory capacity could still be useful for a broad range of workloads. Whether the economics of such a (somewhat) more general-purpose design would overcome the benefits of specialization for the somewhat high potential volume of that subset of the cloud workloads is far beyond my abilities to even reasonably guess.
I would like to see greater innovation in chip design, but I am skeptical that smaller caches and more processing elements are necessarily a good fit (though the UltraSPARC T series might just point to how ignorant I am in this regard). (Of course, the tradeoffs could change if dense persistent logic-compatible memory becomes practical.)