optimizing llvmpipe vertex/fragment processing.

Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite threading rasterization, fragment shading and blending stages, never did anything else while those were happening.

I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.

At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.

The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.

Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.

Emma Anholt ran some benchmarks on the results with the paraview traces and got

  • pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15)
  • pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3)
which seems like a good return on the investment.

I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits.

Comments

Popular posts from this blog

tinygrad + rusticl + aco: why not?

nvk: the kernel changes needed

video decode: crossing the streams