One of my pet peeves around running local LLMs and inferencing is the sheer mountain of shit^W^W^W complexity of compute stacks needed to run any of this stuff in an mostly optimal way on a piece of hardware. CUDA, ROCm, and Intel oneAPI all to my mind scream over-engineering on a massive scale at least for a single task like inferencing. The combination of closed source, over the wall open source, and open source that is insurmountable for anyone to support or fix outside the vendor, screams that there has to be a simpler way. Combine that with the pytorch ecosystem and insanity of deploying python and I get a bit unstuck. What can be done about it? llama.cpp to me seems like the best answer to the problem at present, (a rust version would be a personal preference, but can't have everything). I like how ramalama wraps llama.cpp to provide a sane container interface, but I'd like to eventually get to the point where container complexity for a GPU compute stack isn't really ...
You can get working Vulkan drivers on Ubuntu with this PPA:
ReplyDeletehttps://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers/
Do they know? https://bugs.launchpad.net/ubuntu/+source/mesa has only one bug mentioning radv, and it's not about radv.
ReplyDeleteThis is the Ubuntu bug:
Deletehttps://bugs.launchpad.net/ubuntu/+source/mesa/+bug/1720890
yes, they do: https://bugs.launchpad.net/ubuntu/+source/mesa/+bug/1720890
ReplyDelete