Thursday, March 11, 2010

The software renderer

There's a number of great software renderers out there. SwiftShader and DirectX WARP being the two widely known ones. Unfortunately GNU/Linux, and Free Software in general, didn't have a modern software renderer. That's about to change thanks to a project started by José Fonseca. The project is known as the llvmpipe (obviously a development name). José decided that the way forward is writing a driver for Gallium which would code-generate at run-time the rendering paths from the currently set state. LLVM is used to code-generate and optimize the code.

Besides the idea itself, my favorite part of the driver is the software rasterizer. Keith Whitwell, Brian Paul and José implemented a threaded, tiled rasterizer which performs very nicely and scales pretty well with the number of cores. I'm sure one of them will write more about it when they'll have a bit of spare time.

Currently the entire fragment pipeline is code-generated. Over the last two weeks I've been implementing the vertex pipeline, which I'm hoping to merge soon (hence the light smile). Code generating the entire vertex pipeline isn't exactly trivial, but one can divide it into individual pieces and that makes it a bit easier. Start with the vertex shader, then go back and do the fetch and translate, then again move forward and do the emit, then go back and do the viewport transformations and clipping and so on, finally combine all the pieces together.

In between working on the vertex pipeline I've been filling in some missing pieces in the shader compilation. In particular the control-flow. We use the SOA layout which always make control flow a bit tricky. I've just committed support for loops and the only thing left is support for subroutines in shaders so I think we're in a pretty good shape. We can't rock the speedos quite yet, but we're getting there. It's my new measurment for software quality - could it pull off the speedos look? There's few things on this world that can.

Keeping in mind that we haven't even started optimizing it yet, as we're still implementing features, the driver, on my Xeon E5405 runs the anholt.dm_68 OpenArena demo at 25fps (albeit with some artifacts) which is quite frankly pretty impressive, especially if you compare it to the old Mesa3D software renderer that runs the same demo, on the same machine at 3.5 fps.


Anonymous said...

Very cool!

Has Intel given you any support since they've tried for years to get effective CPU based rendering?

Lars said...

Great news!
At what resoulution did you run your OpenArena test?

Brice Goglin said...

What kind of thread model do they use ? OpenMP ? pthread ?

Anonymous said...

Are LLVM is trying to vectorize things? Or are you working on vectorization (SSE) from the start?

Question about pthreads is also interesting? Are threading everything manually?

I have question about compilation strategy. Are you recompiling on the fly whole rasterize for example when game/application changes rendering options (i.e. fullscreen antialiasing or other global options? This will generally make much smaller codebase, make optimalisaion/inlineing more aggressive and save some memory and branch mispredictions.

TechMage89 said...

llvm does take advantage of SIMD processing... that's a fundamental part of its design.

Having a decent software OpenGL implementation will really nice for two reasons:

-Machines that don't have working hardware support for some reason will be a *lot* more usable.

-Potentially, machines that have only partial hw acceleration (eg. my tablet with an Intel GMA950, which doesn't have hw t&l) will be a lot faster if llvmpipe can be hooked in to existing drivers.

Zack said...

@Anonymous: No, not yet at least. It's just VMware folks right now. It's a good point though, we have been discussing the fact that besides the purely software based options llvmpipe would be an almost perfect bare-bone Larrabee driver (would have to adjust vector width and sampling) so there certainly would be enough incentives in it for Intel to take an active role in llvmpipe.

@Lars: Actually I'm not even sure. Whatever the default is when running that anholt.dm_68 demo on a stock openarena install. I think it's 640x480.

@Brice Goglin: it's the Gallium thread lib which uses pthreads on GNU/Linux.

@Anonymous: We are using vectors which are a native type in LLVM and as TechMage89 pointed out SIMD is very much part of what LLVM does (but in some critical paths we are resorting to calling the SSE intrinsics through LLVM). And yes we are code-generating, recompiling and caching (within reason) the entire rendering pipeline when relevant state changes.

Yolande said...

Wonderful! Is this work open source, btw?

Anonymous said...

@Yolande: AFAIK yes. Look at git://

skierpage said...

Is any of this newfangled Gallium stuff in use in Linux distros, or is it still in the TomorrowLand of dev blog posts?

The Mesa FAQ doesn't say (no mention of Gallium in any of the obvious places). The Mesa OpenGL library has incorporated "Gallium3D infrastructure" since Mesa 7.5, but says nothing about when it's actually used.

AFAICT Gallium might be used in software OpenGL rendering, but as of October 2009 the Gallium ATI driver (different from my driver?) was not ready for primetime. But damn it's hard to figure out. Does glxgears -info print "ZOMG I'm using Gallium!!" to a terminal?

(I've heard a lot about Borat in a man thong, but I've never seen that in real life either.)

Petr Kobalíček said...

Hi Zack,

You said that MESA renders the scene at 3.5 fps. I think that MESA is not multithreaded so if we compare performance per cpu core, we get 6.25 (25 / 4). This means that the presented backend is about twice as fast as MESA per cpu core, is that assumption correct?

Zack said...

@Petr Kobalíček: Not quite. There's a number of things that are important here. Whether an algorithm is suited for parallelization is not a matter of just running it in a separate thread, so comparing an algorithm that was designed with parallelization in mind versus an algorithm that was not, based on how they perform when they're not parallelized is a bit silly. Second of all the number of threads vs performance doesn't scale linearly. The current algo scales very well, but obviously not linearly. And third of all currently only the rasterizer is threaded meaning that the rest of the pipeline runs in a single thread. So all in all the difference will be a lot bigger than 2x (and that magnitude will only be relevant to this particular benchmark). Having said that, one can force llvmpipe to run everything in a single thread (by exporting LP_NUM_THREADS variable) for a more detailed analysis.

Petr Kobalíček said...


This is better than I expected. Making infrastructure for multithreaded rendering is really not easy. I tried to write multithreaded 2d renderer in Fog-Framework and it works, but it can be really better. I designed rendering to be asynchronous, but performance is degraded when painting very small objects.

Thanks for your work, it's really interesting.

jdavis21 said...

Will this stuff run on chrome native client?

emiretsk said...

Hey Zack,
What is the status of the vertex llvm pipe?
I think that right now it generates only the fetch vertex shader and emit parts. Any plans to extend it to also generate clipping and viewport? How about rasterization?

Anonymous said...

Hi. I have a multicore software renderer (opengl based). It can utilize up to 4 cpu cores, has linux and windows support. It can run quake3 playable (with some glitches) on a modern computer.

Its unreleased since ~2 year now. I made several polls and votes on big linux community sites about the stuff.

-on some sites, i got only 10-15 reply.
-on some sites, i got 20-30 reply, half of them saying: ITS SHIT, EVEN A GEFORCE2 IS MUTCH BETTER, GO TO THE FUCK! YOU SUCK!

but mostly: no interest. Peoples just dont care. Sad.


Anonymous said...

Swiftshader started as open source, called swShader. Although the project was removed from sourceforge, you can still download it from