LPC 2022 Accelerators BOF outcomes summary
At Linux Plumbers Conference 2022, we held a BoF session around accelerators.
This is a summary made from memory and notes taken by John Hubbard.
We started with defining categories of accelerator devices.
1. single shot data processors, submit one off jobs to a device. (simpler image processors)
2. single-user, single task offload devices (ML training devices)
3. multi-app devices (GPU, ML/inference execution engines)
One of the main points made is that common device frameworks are normally about targeting a common userspace (e.g. mesa for GPUs). Since a common userspace doesn't exist for accelerators, this presents a problem of what sort of common things can be targetted. Discussion about tensorflow, pytorch as being the userspace, but also camera image processing and OpenCL. OpenXLA was also named as a userspace API that might be of interest to use as a target for implementations.
There was a discussion on what to call the subsystem and where to place it in the tree. It was agreed that the drivers would likely need to use DRM subsystem functionality but having things live in drivers/gpu/drm would not be great. Moving things around now for current drivers is too hard to deal with for backports etc. Adding a new directory for accel drivers would be a good plan, even if they used the drm framework. There was a lot naming discussion, I think we landed on drivers/skynet or drivers/accel (Greg and I like skynet more).
We had a discussion about RAS (Reliability, Availability, Serviceability) which is how hardware is monitored in data centers. GPU and acceleration drivers for datacentre operations define a their own RAS interfaces that get plugged into monitoring systems. This seems like an area that could be standardised across drivers. Netlink was suggested as a possible solution for this area.
Namespacing for devices was brought up. I think the advice was if you ever think you are going to namespace something in the future, you should probably consider creating a namespace for it up front, as designing one in later might prove difficult to secure properly.
We should use the drm framework with another major number to avoid some of the pain points and lifetime issues other frameworks see.
There was discussion about who could drive this forward, and Oded Gabbay from Intel's Habana Labs team was the obvious and best placed person to move it forward, Oded said he would endeavor to make it happen.
This is mostly a summary of the notes, I think we have a fair idea on a path forward we just need to start bringing the pieces together upstream now.