Details on NVIDIA BlueField-3 Programmable Datapath Accelerator?

By: Paul A. Clayton (, October 12, 2021 12:17 pm
Room: Moderated Discussions
The "datasheet" for BlueField-3 only states:

Programmable Datapath Accelerator

  • 16 cores, 256 threads

  • Programmability through DOCA

  • Heavy multi-threading applications acceleration

I have not looked at the documentation for DOCA, but I suspect like CUDA for GPUs it only provides hints about the architecture and microarchitecture (exposing what would somewhat portably fit design specifics). I have not watched the Hot Chips 33 presentation "NVIDIA DATA Center Processing Unit (DPU) Architecture", though I suspect that did not provide many details (based on other coverage of that presentation).

I would guess that they have some similarity to the 'general-purpose' processors in network processors, using multithreading and/or very simple cores to reduce the throughput impact of latency events.

The seemingly fixed 256 cores per 16 cores seems to imply fixed resource allocation in terms of "register" storage unlike GPUs which can exploit context size variation. Such also seems to imply that core-internal context sharing is not provided (I doubt such would be particularly useful; sharable circular buffers might be more useful in providing low energy and latency communication with less timing constraints than more limited capacity core-internal storage though even just shared scratchpad/tightly-coupled memor might serve well enough).

The network topology connecting the Programmable Datapath Accelerators presents optimization opportunities. Rings would seem to fit very well with some types pipelined processing, but the processing demand for different stages might not be well-balanced (one could send data to more than one processor and/or prioritize thread activity (if lightweight or non-critical threads are available for scheduling to very low-priority threads)). One might also use a mixture of toplogies at each node (e.g., a lower bandwidth crossbar connecting all cores with most traffic going over a single- or bi-directional [possibly asymmetric] ring) or for connecting groups of nodes (e.g., four cores at inward corners of four core clusters might communicate with a crossbar while a ring connects cores within a cluster and the "outer" cores form a ring). Manufacturing considerations (being able to have spares) would seem to constrain topology, but I suspect some non-uniformity might still fit with having spare cores.
 Next Post in Thread >
TopicPosted ByDate
Details on NVIDIA BlueField-3 Programmable Datapath Accelerator?Paul A. Clayton2021/10/12 12:17 PM
  Details on NVIDIA BlueField-3 Programmable Datapath Accelerator?Mark Roulo2021/10/12 03:03 PM
    Details on NVIDIA BlueField-3 Programmable Datapath Accelerator?Paul A. Clayton2021/10/14 10:34 AM
  Details on NVIDIA BlueField-3 Programmable Datapath Accelerator?Wes Felter2021/10/13 08:55 PM
    Details on NVIDIA BlueField-3 Programmable Datapath Accelerator?Paul A. Clayton2021/10/14 10:34 AM
Reply to this Topic
Body: No Text
How do you spell avocado?