Last week I attended Supercomputing 19 on behalf of MLPerf to speak at a panel. I had a fantastic time, met many friends, and learned a lot. Supercomputing is an unusual conference – it’s definitely a trade-show, but also includes real research talks and poster sessinos. I suppose the most similar conference I have attended is SIGGRAPH. I wanted to share a few of my thoughts in a more informal ‘blog’-style write up. Since I was attending in my official MLPerf capacity as chair of the inference and power measurement working groups, I will strive to keep my comments about machine learning companies in the realm of facts, rather than opinion.
For me, SC19 was about the fusion of machine learning (ML) and scientific computing/HPC. I closely track machine learning, and based on my work there are at least 20 companies developing processors for ML training, and well over 100 pursuing the larger and more diverse inference market. The number of companies developing systems based on these processors is probably 2-4X greater. There is a tremendous overlap between these ML processors and HPC systems, I’ve often said that ML training is a particular flavor of HPC and even some inference tasks resemble HPC workloads. Several of these companies had keynotes or announcements at SC including a keynote from Nvidia, and announcements from Graphcore and Cerebras Systems.
The first event I attended was a keynote presentation by Jensen Huang of Nvidia. It was a 2-hour talk that I had mostly seen before, emphasizing the role of machine learning in society with a focus on scientific computing. There were a few nuggets of news that are worth highlighting. First, Nvidia’s V100 GPU was already capable of doing RDMA over InfiniBand and bypassing the host processor and host memory. This approach saves power and frees up host CPU cycles, memory bandwidth, and PCIe bandwidth. Nvidia has a preview feature (first demonstrated at GTC in March) that extends this capability to remote storage accessed via RDMA over InfiniBand. For example, in a supercomputer with a parallel filesystem, the GPUs can now access that storage with less host overhead and Nvidia demonstrated a scenario where this boosts performance by 5X. An excellent blog post by Adam Johnson and CJ Newburn dives into more details of the accelerated remote storage.
Nvidia also announced beta support for AMD and ARM-based processors in their tool chain and software stack. There are a number of ARM-based processors that could be attractive in HPC. The Marvell ThunderX2, Fujitsu’s A64FX, and Huawei’s Kunpeng 920 (probably only in China!). The A64FX in particular was explicitly designed for HPC – it uses HBM2 and has a whopping 1TB/s of memory bandwidth (albeit only 32GB!). The only downside is the relatively low PCIe bandwidth – a single PCIe 3.0 x16 link may not be enough for a good HPC fabric in 2021. The integrated TofuD fabric is quite good, but it is unclear if that is available to third parties.
Of personal interest to me, Jensen also lauded MLPerf Inference for our recently released set of machine learning benchmarks and results. I was one of the leads on this project, and it was lovely to hear the whole team called out in public for our good work. MLPerf Inference v0.5 is a great start and we look forward to improving future versions. The full details of the benchmark are available in our Arxiv paper MLPerf Inference Benchmark. Naturally, Jensen also acknowledged the Nvidia team for delivering a huge number of results across many of the benchmarks.
Another topic Jensen touched on was using machine learning to complement simulation in scientific computing. For example, there was a recent paper “Newton vs the machine: solving the chaotic three-body problem using deep neural networks” that uses simulation to train a neural network that can then replace a numerical solver. I haven’t read the paper completely, but the authors are claiming a 100 million times speed up for using the neural network compared to the solver and the training time appears to be less than two weeks. While this is a promising result, I don’t know how well it would generalize to other areas. The 3-body problem has been extensively studied and is far simpler than real world scenarios, which would require far more data (and more diverse data) to produce good models.
Graphcore released a set of training and inference benchmarks comparing their Colossus IPU against Nvidia GPUs earlier this month and announced availability through Dell and Microsoft’s Azure cloud. The IPU is quite architecturally different from Nvidia’s GPUs – it is an all SRAM-based architecture with about 1200 cores and a crossbar fabric between all cores. Additionally, it has high-speed serdes for communication between multiple IPUs and PCIe 4.0 interface to the host processor. Based on Graphcore’s benchmarks, the IPU performs quite well on workloads with certain types of sparse data.
Cerebras is quite a different story, and one that I’ve been familiar with for the last several years. They are the first company to successfully demonstrate wafer-scale integration – a monumental technical achievement that was first attempted in the 1970’s. The processor was first publicly described at Hot Chips earlier this year. It is an array of roughly 400K custom-designed cores with local memory, connected by a mesh fabric. In aggregate the processor offers 18GB of SRAM and 100Pbit/s of bandwidth across the fabric. At SC, the company described their software stack – which supports TensorFlow and PyTorch – and announced that Argonne National Labs has installed several systems.
The main reason I was at SC19 was to speak on a panel about benchmarking ML for HPC. Our host and organizer was Murali Emani of Argonne National Labs, and I was joined by Steve Farrell of Lawrence Berkeley Labs, Sergey Serebryakov of HPE, Natalia Vassilieva of Cerebras Systems, and Samuel Jackson from the Science and Technology Facilities Council in the UK. We all spoke for 5-10 minutes and then had a joint panel and Q&A session. My talk was a quick presentation of the MLPerf Training benchmark and then a discussion of how to adapt it to HPC. Each of the other speakers had fantastic insights into HPC and I learned a great deal from their presentations and the audience questions. Overall, I think we concluded that ML benchmarking for HPC is incredibly hard and we are just starting to scratch the surface. Here are a few of my observations, in no particular order:
- Benchmarking with storage is hard and expensive (e.g., TPC-C costs millions to do because of storage requirements), but real HPC requires real storage performance. How do we manage this balance?
- While general ML tasks such as recommendation engines generalize well, it is unclear if HPC-focused ML problems do. Some HPC problems are highly specific.
- Customers and system designers often want to augment ML benchmarks that measure performance with performance analysis metrics. E.g., if a convolutional kernel is running on a set of dedicated cores, what is the utilization of those cores? What is the memory consumption during run time? Natalia presented a good view of how Cerebras thinks about performance optimization for their system.
- Power measurement is vital, but incredibly complicated. As one example, for many supercomputers the efficiency of the overall facilities varies considerably. Should that be incorporated? How do we account for the fact that data centers near the North Pole can sell their heated air, whereas it is ‘waste heat’ in Arizona?
- There are many HPC data sets and data sets are necessary (but not sufficient) for benchmark creation – but it’s unclear if anyone has catalogued them. This was proposed as an activity for the MLPerf HPC group, and I think that’s a great idea.
- Sergey presented a particularly useful taxonomy of HPC and the overlap with commercial applications. I’m not as deeply involved in the HPC world, and the overview of the landscape was really helpful. He also brought up another point – scientific computing is often a building block for commericial applications.