tinygrad + rusticl + aco: why not?

I recently came across tinygrad as a small powerful nn framework that had an OpenCL backend target and could run LLaMA model.

I've been looking out for rusticl workloads, and this seemed like a good one, and I could jump on the AI train, and run an LLM in my house!

I started it going on my Radeon 6700XT with the latest rusticl using radeonsi with the LLVM backend, and I could slowly interrogate a model with a question, and it would respond. I've no idea how performant it is vs ROCm yet which seems to be where tinygrad is more directed, but I may get to that next week.

While I was there though I decided to give the Mesa ACO compiler backend a go, it's been tied into radeonsi recently, and I done some hacks before to get compute kernels to run. I reproduced said hacks on the modern code and gave it a run.

tinygrad comes with a benchmark script called benchmark_train_efficientnet so I started playing with it to see what low hanging fruit I could find in an LLVM vs ACO shootout.

The bench does 10 runs, the first is where lots of compilation happens, the last is well primed cache wise. There are the figures from the first and last runs with a release build of llvm and mesa. (and the ACO hacks).

LLVM:

215.78 ms cpy,  12245.04 ms run,  120.33 ms build, 12019.45 ms realize,  105.26 ms CL,   -0.12 loss,  421 tensors, 0.04 GB used,      0.94 GFLOPS

10.25 ms cpy,   221.02 ms run,   83.50 ms build,   36.25 ms realize,  101.27 ms CL,   -0.01 loss,  421 tensors, 0.04 GB used,     52.11 GFLOPS

ACO:

71.10 ms cpy,  3443.04 ms run,  112.58 ms build, 3214.13 ms realize,  116.34 ms CL,   -0.04 loss,  421 tensors, 0.04 GB used,      3.35 GFLOPS
10.36 ms cpy,   234.90 ms run,   84.84 ms build,   36.51 ms realize,  113.54 ms CL,    0.05 loss,  421 tensors, 0.04 GB used,     49.03 GFLOPS

So ACO is about 4 times faster to compile but produces binaries that are less optimised.

The benchmark produces 148 shaders:

LLVM:

126 Max Waves: 16 
  6 Max Waves: 10
  5 Max Waves: 9
  6 Max Waves: 8
  5 Max Waves: 4


ACO:

 96 Max Waves: 16
 36 Max Waves: 12
  2 Max Waves: 10
 10 Max Waves: 8
  4 Max Waves: 4

So ACO doesn't quite get the optimal shaders for a bunch of paths, even with some local hackery I've done to make it do better.[1]

I'll investigate ROCm next week maybe, got a bit of a cold/flu, and large GPU stacks usually make me want to wipe the machine after I test them :-P

[1] https://gitlab.freedesktop.org/airlied/mesa/-/commits/radeonsi-rusticl-aco-wip


Comments

Popular posts from this blog

nvk: the kernel changes needed

video decode: crossing the streams