Tuesday, November 02, 2010

2D musings

If you've been following graphics developments in the 2D world over the last few years you've probably seen a number of blogs and articles complaining about performance. In particular about how slow 2D is on GPUs. Have you ever wondered why it's possible to make this completely smooth but your desktop still sometimes feels sluggish?

Bad model

For some weird reason ("neglect" being one of them) 2D rendering model hasn't evolved at all in the last few years. That is if it has evolved at all since the very first "draw line" became a function call. Draw line, draw rectangle, draw image, blit this, were simply joined by fill path, stroke path, few extra composition modes and such. At its very core the model remained the same though, meaning lots of calls to draw an equally large number of small primitives.

This worked well because technically zero, or almost zero, setup code was necessary to start rendering. Then GPUs became prevalent and they could do amazing things but to get them to do anything you had to upload the data and the commands that would tell them what to do. With time more and more data had to be sent to the GPU to describe the increasingly complex and larger scenes. It made sense to optimize the process of uploads (I keep calling them "uploads" but "GPU downloads" is closer to the true meaning) by allowing to upload an entire resource once and then refer to it via a handle. Buffers, shaders, addition of new shading stages (tessellation, geometry) all meant to reduce the size of data that had to be uploaded to the GPU before every rendering.

At least for games and well designed 3D software. 2D stuck to its old model of "make GPU download everything on every draw request". It worked ok because most of the user interface was static and rather boring so the performance was never much of an issue. Plus in many cases the huge setup costs are offset by the fact that the Graphics Processing Units are really good at processing graphics.

Each application is composed of multiple widgets each widget draws itself using multiple primitives (pixmaps, rectangles, lines, paths) and each primitive needs to first upload the data needed by the GPU to render it. It's like that because from the 2D api perspective there's no object persistence. The api has no idea that you keep re-rendering the same button over and over again. All the api sees is another "draw rectangle" or "draw path" call which it will complete.

On each frame the same data is being copied to the GPU over and over again. It's not very efficient, is it? There's a limited number of optimizations you can do in this model. Some of the more obvious ones include:
  • adding unique identifiers to the pixmaps/surfaces and using those as identifiers as keys in a texture cache which allows you to create a texture for every pixmap/surface only once,
  • collecting data from each draw call in a temporary buffer and copying it all at once (e.g. in SkOSWindow::afterChildren, QWindowSurface::endPaint or such),
  • creating a shader cache for different types of fills and composition modes

  • But the real problem is that you keep making the GPU download the same data every frame and unfortunately that is really hard to fix in this model.

    Fixing the model

    It all boils down to creating some kind of a store where lifetime of an object/model is known. This way the scene knows exactly what objects are being rendered and before rendering begins it can initialize and upload all the data the items need to be renderer. Then rendering is just that - rendering. Data transfers are limited to object addition/removal or significant changes to their properties and then further limited by the fact that a lot of the state can always be reused. Note that trivial things like changing the texture (e.g. on hover/push) don't require any additional transfers and things like translations can be limited to just two floats (translation in x and y) and they're usually shared for multiple primitives (e.g. in a pushbutton it would be used by the background texture and the label texture/glyphs)

    It would seem like the addition of QGraphicsView was a good time to change the 2D model, but that wasn't really possible because people like their QPainter. No one likes when a tool they have been using for a while and are fairly familiar with is suddenly taken away. Completely changing a model required a more drastic move.

    QML and scene-graph

    QML fundamentally changes the way we create interfaces and it's very neat. From the api perspective it's not much different from JavaFX and one could argue which one is neater/better but QML allows us to almost completely get rid of the old 2D rendering model and that's why I love it! A side-effect of moving to QML is likely the most significant change we've done to accelerated 2D in a long time. The new Qt scene graph is a very important project that can make a huge difference to the performance, look and feel of 2D interfaces.
    Give it a try. If you don't have OpenGL working, no worries it will work fine with Mesa3D on top of llvmpipe.

    A nice project would be doing the same in web engines. We have all the info there but we decompose it into the draw line, draw path, draw rectangle, draw image calls. Short of the canvas object which needs the old style painters, everything is there to make accelerated web engines a lot better at rendering the content.

    Thursday, October 28, 2010

    Intermediate representation again

    I wrote about intermediate representation a few times already but I've been blogging so rarely that it feels like ions ago. It's an important topic and one that is made a bit more convoluted by Gallium.

    The intermediate representation (IR) we use in Gallium is called Tokenized Gallium Shader Instructions or TGSI. In general when you think about IR you think about some middle layer. In other words you have a language (e.g. C, GLSL, Python, whatever) which is being compiled into some IR which then is transformed/optimized and finally compiled into some target language (e.g. X86 assembly).

    This is not how Gallium and TGSI work or were meant to work. We realized that people were making that mistake so we tried to back-paddle on the usage of the term IR when referring to TGSI and started calling it a "transport" or "shader interface" which better describes its purpose but is still pretty confusing. TGSI was simply not designed as a transformable representation. It can be done, but it's a lot like a dance-off on a geek conference - painful and embarrassing for everyone involved.

    The way it was meant to work was:

    Language -> TGSI ->[ GPU specific IR -> transformations -> GPU ]

    with the parts in the brackets living in the driver. Why like that? Because GPUs are so different that we thought each of them would require its own transformations and would need its own IR to operate on. Because we're not compiler experts and didn't think we could design something that would work well for everyone. Finally and most importantly because it's how Direct3D does it. Direct3D functional specification is unfortunately not public which makes it a bit hard to explain.

    The idea behind it was great in both its simplicity and overabundance of misplaced optimism. All graphics companies had Direct3D drivers, they all had working code that compiled from Direct3D assembly to their respective GPUs. "If TGSI will be a lot like Direct3D assembly then TGSI will work with every GPU that works on Windows, plus wouldn't it be wonderful if all those companies could basically just take that Windows code and end up with a working shader compiler for GNU/Linux?!", we thought. Logically that would be a nice thing. Sadly companies do not like to release part of their Windows driver code as Free Software, sometimes it's not even possible. Sometimes Windows and Linux teams never talk to each other. Sometimes they just don't care. Either way our lofty goal of making the IR so much easier and quicker to adopt took a pretty severe beating. It's especially disappointing since if you look at some of the documentation e.g. for AMD Intermediate Language you'll notice that this stuff is essentially Direct3D assembly which is essentially TGSI (and most of the parts that are in AMD IL and not in TGSI are parts that will be added to TGSI) . So they have this code. In the case of AMD it's even sadder because the crucial code that we need for OpenCL right now is OpenCL C -> TGSI LLVM backend which AMD already does for their IL. Some poor schmuck will have to sit down and write more/less the same code. Of course if it's going to be me it's "poor, insanely handsome and definitely not a schmuck".

    So we're left with Free Software developers who don't have access to the Direct3D functional spec and who are being confused by the IR which is unlike anything they've seen (pre-declared registers, typeless...) which on top of it is not easily transformable. TGSI is very readable, simple and pretty easy to debug though so it's not all negative. It's also great if you never have to optimize or transform its structure which unfortunately is rather rare.
    If we abandon the hope of having the code from Windows drivers injected in the GNU/Linux drivers it becomes pretty clear that we could do better than TGSI. Personally I just abhor the idea of rolling out our own IR. IR in the true sense of that word. Crazy as it may sound I'd really like my compiler stuff to be written by compiler experts. It's the main reason why I really like the idea of using LLVM IR as our IR.

    Ultimately it's all kind of taking the "science" out of "computer science" because it's very speculative. We know AMD and NVIDIA use it to some extend (and there's an open PTX backend for LLVM) , we like it, we use it in some places (llvmpipe), the people behind LLVM are great and know their stuff but how hard is it to use LLVM IR as the main IR in a graphics framework and how hard is it to code-generate directly from it for GPUs - we don't really know. It seems like a really good idea, good enough for folks from LunarG to give it a try which I think is really what we need; a proof that it is possible and doesn't require sacrificing any farm animals. Which, as a vegetarian, I'd be firmly against.

    Thursday, July 01, 2010

    Graphics drivers

    There are only two tasks harder than writing Free Software graphics drivers. One is running a successful crocodile petting zoo, the other is wireless bungee jumping.

    In general writing graphics drivers is hard. The number of people who can actually do it is very small and the ones who can do it well are usually doing it full-time already. Unless the company, which those folks are working for, supports open drivers, the earliest someone can start working on open drivers is the day the hardware is officially available. That's already about 2 years too late, maybe a year if the hardware is just an incremental update. Obviously not a lot of programmers have the motivation to do that. Small subset of the already very small subset of programmers who can write drivers. Each of them worth their weight in gold (double, since they're usually pretty skinny).

    Vendors who ship hardware with GNU/Linux don't see a lot of value in having open graphics drivers on those platforms. All the Android phones are a good example of that. Then again Android decided to use Skia, which right now is simply the worst of the vector graphics frameworks out there (Qt and Cairo being two other notables). Plus having to wrap every new game port (all that code is always C/C++) with NDK is a pain. So the lack of decent drivers is probably not their biggest problem.

    MeeGo has a lot better graphics framework, but we're ways of before we'll see anyone shipping devices running it and Nokia in the past did "awesome" things like release GNU/Linux devices with a GPU but without any drivers for it (n800, n810) and Intel's Poulsbo/Moorestown graphics drivers woes are well-known. On the desktop side it seems that Intel folks are afraid that porting their drivers to Gallium will destabilize them. Which is certainly true, but the benefits of doing so (multiple state trackers, cleaner driver code, being able to use Gallium debugging tools like trace/replay/rbug and nicely abstracting the api code from drivers) would be well worth it and hugely beneficial to everyone.

    As it stands we have an excellent framework in Gallium3D but not a lot of open drivers for it. Ironically it's our new software driver, llvmpipe, or more precisely a mutation of it, which has the potential to fix some of our GPU issues in the future. With the continues generalization of GPUs my hope is that all we'll need is DRM code (memory management, kernel modesetting, command submission) and LLVM->GPU code generator. It's not exactly a trivial amount of code by any stretch of the imagination but smaller than what we'd need right now and once it would be done for one or two GPUs it would certainly become a lot simpler. Plus GPGPU will eventually make the latter part mandatory anyway. Having that would get us a working driver right away and after that we could play with texture sampling and vertex paths (which will likely stay as dedicated units for a while) as optimizations.

    Of course it would be even better if a company shipping millions of devices with GNU/Linux wanted a working graphics stack from bottom up (memory management, kernel modesetting, Gallium drivers with multiple state trackers, optimized for whatever 2D framework they use) but that would make sense. "Is that bad?" you ask. Oh, it's terrible because GNU/Linux graphics stack on all those shipped devices is apparently meant to defy logic.

    Monday, June 21, 2010

    No hands!

    I've spent the last few weeks in London with Keith Whitwell and José Fonseca, the greatest computer graphics experts since the guy who invented the crystal ball in the Lord of the Rings, the one that allowed you to see your future in full immersive 3D. Which begs the question: do you think that there was an intermediate step to those? I mean a crystal ball that showed you your future but 2D, all pixelated and in monochrome. Like Pong. You got 8x8 pixels and then the wizard/witch really needed to get their shit together to figure out what's going on, e.g.
    - "This seems to be the quintessential good news/bad news kind of a scenario: good news - you'll be having sex! Bad news - with a hippo.",
    - "There's no hippos in Canada...",
    - "Right, right, I see how that could be a problem. How about elephants with no trunks?",
    - "None of those",
    - "Walls maybe?",
    - "Yea, got those",
    - "Excellent! Well then you'll probably walk into a one.".
    Just once I'd like to see a movie where they're figuring this stuff out. If you'll shoot it, I'll write it!

    Getting back on track: we've got a lot of stuff done. I've worked mainly on the Gallium interface trying to update it a bit for modern hardware and APIs.

    First of all I've finally fixed geometry shaders in our software driver. The interface for geometry shader is quite simple, but the implementation is a bit involved and I've never done it properly. Until now that is. In particular emitting multiple primitives from a single geometry shader never worked properly. Adjacency primitives never worked. Drawing indexed vertex buffer with a geometry shader never worked properly and texture sampling in GS was just never implemented. All of that is fixed and looks very solid.

    Second of all stream out, aka transform feedback is in. Both in terms of interface additions and software implementation. Being able to process a set of vertices and stream them out into a buffer is a very useful feature. I still enjoy vector graphics and in there processing a set of data points is a very common operation.

    I've also added the concept of resources to our shader representation, TGSI. We needed a way of arbitrarily binding resources to sampler and defining load/sample formats for resources in shaders. The feature comes with a number of new instructions. The most significant in terms of their functionality are the load and gather4 instructions because they represent first GPGPU centric features we've added to Gallium (obviously both very useful in graphics algos as well).

    Staying in TGSI land, we have two new register files. I've added immediate constant array and temporary arrays. Up til now we had really no way of declaring a set of temporaries or immediates as an array. We would imitate that functionality by declaring ranges e.g. "DCL TEMP[0..20]", which becomes very nasty when a shader is using multiple indexable arrays, it also makes bounds checking almost impossible.

    A while back Keith wrote Graw which is a convenience state tracker that exposes the raw Gallium interface. It's a wonderful framework for testing Gallium. We have added a few tests for all the above mentioned features.

    All of those things are more developer features than anything else though. They all need to be implemented in state trackers (e.g. GL) and drivers before they really go public and that's really the next step.

    Tuesday, April 27, 2010

    Geometry Processing - A love story

    It's still early in the year but I feel like this is the favorite for the "best computer science related blog title of 2010". I'd include 2009 in that as well, but realistically speaking I probably wrote something equally stupid last year and I don't have the kind of time required to read what I write.

    Last week I've merged the new geometry processing pipeline and hooked it into our new software driver, llvmpipe. It actually went smoother than I thought it would, i.e. it just worked. I celebrated by visiting ER, a harsh reminder that my lasting fascination with martial arts will be the end of me (it will pay dividends once llvmpipe becomes self-aware though).

    The codepaths are part of the Gallium3D Draw module. It's fairly simple once you get to it. At a draw call we generate the optimal vertex pipeline for the currently set state. LLVM makes this stuff a lot easier. First of all we get the human-readable LLVM IR which is a lot easier than assembly to go over if something goes wrong. Running LLVM optimization passes over a fairly significant amount of code is a lot easier than having to hand-optimize assembly code-generation. Part of it is that geometry processing is composed of a few fairly distinct phases (e.g. fetch, shade, clip, assemble, emit) and given that they all can and will change depending on the currently set state makes it difficult to code-generate optimal code by hand. That is unless you have a compiler framework like LLVM. Then you don't care and hope LLVM will bail you out in cases where you end up doing something stupid for the sake of code-simplicity.

    A good example of that is our usage of alloca's for all variables. Initially all variables were in registers but I switched the code to use alloca's for a very simple reason: doing flow control in SOA mode when everything was in registers was tough, it meant keeping track of the PHI nodes. Not to mention that we had no good way of doing indirect addressing in that scheme. Using alloca's makes our code a lot simpler. In the end, thanks to LLVM optimization passes (mem2reg in this case), virtually every usage of allocas is eliminated and replaced with direct register access.

    Anyway, with the new geometry paths the improvements are quite substantial. For vertex processing dominated tests (like geartrain which went from 35fps to 110fps) the improvements are between 2x to about 6x, for cases which are dominated by fragment processing it's obviously a lot less (e.g. openarena went from about 25fps to about 32fps). All in all llvmpipe is looking real good.

    Thursday, March 11, 2010

    The software renderer

    There's a number of great software renderers out there. SwiftShader and DirectX WARP being the two widely known ones. Unfortunately GNU/Linux, and Free Software in general, didn't have a modern software renderer. That's about to change thanks to a project started by José Fonseca. The project is known as the llvmpipe (obviously a development name). José decided that the way forward is writing a driver for Gallium which would code-generate at run-time the rendering paths from the currently set state. LLVM is used to code-generate and optimize the code.

    Besides the idea itself, my favorite part of the driver is the software rasterizer. Keith Whitwell, Brian Paul and José implemented a threaded, tiled rasterizer which performs very nicely and scales pretty well with the number of cores. I'm sure one of them will write more about it when they'll have a bit of spare time.

    Currently the entire fragment pipeline is code-generated. Over the last two weeks I've been implementing the vertex pipeline, which I'm hoping to merge soon (hence the light smile). Code generating the entire vertex pipeline isn't exactly trivial, but one can divide it into individual pieces and that makes it a bit easier. Start with the vertex shader, then go back and do the fetch and translate, then again move forward and do the emit, then go back and do the viewport transformations and clipping and so on, finally combine all the pieces together.

    In between working on the vertex pipeline I've been filling in some missing pieces in the shader compilation. In particular the control-flow. We use the SOA layout which always make control flow a bit tricky. I've just committed support for loops and the only thing left is support for subroutines in shaders so I think we're in a pretty good shape. We can't rock the speedos quite yet, but we're getting there. It's my new measurment for software quality - could it pull off the speedos look? There's few things on this world that can.

    Keeping in mind that we haven't even started optimizing it yet, as we're still implementing features, the driver, on my Xeon E5405 runs the anholt.dm_68 OpenArena demo at 25fps (albeit with some artifacts) which is quite frankly pretty impressive, especially if you compare it to the old Mesa3D software renderer that runs the same demo, on the same machine at 3.5 fps.

    Wednesday, February 10, 2010

    3D APIs

    Do you know where babies come from? It's kinda disturbing, isn't it? I blame it on the global warming. Stay with me here and I'll lead you through this. 1) Storks deliver babies (fact), 2) I, not being very careful about how I word things cause global warming (also a fact), 3) global warming kills storks (my theory) 4) evolution takes over and replaces storks with the undignified system that we have right now (conclusion).

    Speaking about not being very careful about how I word things, how about me announcing those DirectX state trackers coming to GNU/Linux in my last blog, eh? By the way, now that's a good transition. The announcement was very exciting, albeit a little confusing, in particular to myself. Likely because it's really not what I meant. Mea culpa!

    What I was talking about is adding features to the Gallium interfaces that will allow us to support those API's, not about actually releasing any of those API's as state trackers for Gallium. Gallium abstracts communication with the hardware and the support for OpenGL 3.2 + few extensions and Direct3D 10.x requires essentially the same new interfaces in Gallium. So support for Direct3D 10.x means that we can support OpenGL 3.2 + those extra extensions. I don't think that it's a big secret that while we do have and are shipping Direct3D 9 state tracker and are working on a Direct3D 10 state tracker, to my knowledge, we simply don't have any plans to release that code. It's Windows specific anyway, so it wouldn't help the Free Software community in any shape or form.

    I know there's that misguided belief that a Direct3D state tracker would all of a sudden mean that all Windows games work on GNU/Linux. That's just not true, there's a lot more to an API than "draw those vertices and shade it like that", windowing system integration, file io, networking, input devices, etc. If software is DirectX specific then it's pretty clear that the authors didn't use X11 and Posix apis to implement all that other functionality. Those windows.h/windowsx.h includes are pretty prominent in all of those titles. So one would have to wrap entire Windows API to be able to play DirectX games. That's what Wine is about and they're better off using OpenGL like they are now.

    The issue with gaming on GNU/Linux isn't the lack of any particular API, it's the small gaming market. If an IHV would release a device that 500 million people would buy and the only programming language supported would be COBOL and the only 3D API would be PHIGS then we would see lots of games in COBOL using PHIGS. Sure, the developers wouldn't like it, but the brand new Ferraris that the company could buy, after release of their first title, for everyone, including all the janitors, would make up for that.

    The bottom line is that the majority of games could easily be ported to GNU/Linux since the engines usually are cross-platform anyway, at the very least to support different consoles. DirectX, libgcm, OpenGL ES... either they're all already supported or could be easily made to support them. It's simply that it's not cost-effective to do so. A good example is Palm Pre which is running GNU/Linux and even before the official release of their 3D SDK (PDK) which uses SDL they already got EA Mobile to port Need for Speed, The Sims and others to their device.
    On the other hand if a title supports only DirectX or only libgcm it's usually because it's exclusive to the given platform and presence of the same API on other platforms will not make the vendor suddenly release the title for the new system.

    So yea, Direct3D on GNU/Linux simply means nothing. We won't get more games, won't make it easier to port the already cross platform engines, won't allow porting of the exclusive titles and will not fill any holes in our gaming SDKs. Besides ethically speaking we should support OpenGL, not a closed API from a closed platform.

    Monday, February 08, 2010

    New features

    Lately we've been working a lot on Gallium but we haven't been publicly talking about it enough. I, personally, spent a considerable amount of time last week coming up with my strategy for the inevitable monster invasion. In general I always thought that if, sorry, "when" the monsters attack I'd be one of the very first ones to go on account of my "there's a weird noise coming from that pitch black spot over there, I better check it out" tendency. Obviously that's not ideal, especially if the need to repopulate the world arises; I just really feel like I should be present for that. In conclusion I'll quickly go over some things we've been doing with Gallium. Admittedly that wasn't one of my best transitions.

    There are three new major APIs that we want to support in Gallium. OpenCL 1.0, DirectX 10.x and DirectX 11. DirectX 10.x was somewhat prioritized of late because it's a good stepping stone for a lot of the features that we wanted.

    Two of my favorites were the geometry shaders and new blending functionality. I want to start with the latter which Roland worked on because it has immediate impact on Free Software graphics.

    One of the things that drives people bonkers is text rendering. In particular subpixel rendering, or if you're dealing with Xrender component alpha rendering
    Historically both GL and D3D provided fixed-function blending that provided means of combining source colors with the content of a render buffer in a number of ways. Unfortunately the inputs to the blending units were constrained to a source color, destination color or constants that could be used in their place. That was unfortunate because the component alpha math required two distinct values: the source color and the blending factor for destination colors (assuming the typical Porter&Duff over rendering). D3D 10 dealt with it by adding functionality called dual-source blending (OpenGL will get an extension for it sometime soon). The idea being that the fragment shader may output two distinct colors which will be treated as two independent inputs to the blending units.
    Thanks to this we can support subpixel rendering in a single pass with a plethora of compositing operators. Whether you're a green ooze trying to conquer Earth or a boring human you have to appreciate 2x faster text rendering.

    Geometry shaders introduce a new shader type, run after the vertices have been transformed (after the vertex shader), but before color clamping, flat shading and clipping.

    Along the support for geometry shaders we have added two major features to TGSI ("Tokenized Gallium Shader Instructions", our low level graphics language).
    The first one was support for properties. Geometry shaders in Gallium introduced the notion of state aware compile. This is because compilation of a geometry shader is specific to, at the very least, the input and output primitive they respectively operate on and output. We deal with it by injecting PROPERTY instructions to the program as so:
    GEOM
    PROPERTY GS_INPUT_PRIMITIVE TRIANGLES
    PROPERTY GS_OUTPUT_PRIMITIVE TRIANGLE_STRIP
    (rest of geometry shader follows)
    The properties are easily extendable and are the perfect framework to support things like work-group size in OpenCL.
    The second feature is support for multidimensional inputs in shaders. The syntax looks as follows:
    DCL IN[][0], POSITION
    DCL IN[][1], COLOR
    which declares two input arrays. Note that the size of the arrays is implicit, defined by the GS_INPUT_PRIMITIVE property.

    It's nice to see this framework progressing so quickly.

    Summing up, this is what yin yang is all about. Do you know what yin yang is? Of course you don't, you know nothing of taoism. Technically neither do I but a) it sounds cool, b) I've been busy coming up with a "monster invasion" contingency plan, so couldn't be bothered with some hippy concepts, c) the previous two points are excellent.