Wednesday, February 06, 2008

GPGPU

Would you like to buy a vowel? Pick "j", it's a good one. So what if it's not a vowel. My blog, my rules. Lately I had a major crash on all things "J". Which is why I moved to Japan.

It's part of my "Most expensive places in the world" tour, unlikely coming to a city near you. I lived in New York City, Oslo, London and now Tokyo. I'm going to write a book about all of that entitled "How to see the world while having no money whatsoever". It's really more of a pamphlet. I have one sentence so far "Find good friends" and the rest are just pictures of black (and they capture the very essence of it).

José Fonseca helped me immensely with the move to Japan, which was great. Japan is amazing, even though finding vegetarian food is almost like a puzzle game and trying to read Japanese makes me feel very violated. So if you live in Tokyo your prayers have been answered, I'm here for your pleasure. Depending on your definition of pleasure of course.

Short of that I've been working on this "graphics" thing. You might have heard of it. Apparently it's real popular in some circles. I've been asked about GPGPU a few times and since I'm here to answer all questions (usually in the most sarcastic way possible... don't judge me, bible says not to, I was born this way) I'm going to talk about GPGPU.

To do GPGPU there's ATI's CTM, NVIDIA's Cuda, Brooke and a number of others. One of the issues is that there is no standard API for doing GPGPU across GPU's from different vendors so people end up using e.g. OpenGL. So the question is whether Gallium3D could make such things as scatter reads accessible, without falling back to using vertex shaders or vertex shaders/fragment shaders combination to achieve them.

Core purpose of Gallium3D is to model the way graphics hardware actually works. So if the ability to do scatter reads is available in modern hardware then Gallium3D will have support for it in the API. Now having said that, it looks like scatter reads are usually done in a few steps, meaning that while some of the GPGPU specific api's expose it as one call, internally number of cycles passes as few instructions are actually being executed to satisfy the request. As such this functionality is obviously not the best to expose in a piece of code which models the way hardware works. That functionality one would implement on top of that api.
I do not have docs for the latest GPU's from ATI and AMD so I can't say what it is that they definitely support. If you have that info let me know. As I said the idea being that if we'll see hardware supporting something natively then it will be exposed in Gallium3D.

Also you wouldn't want to use Gallium3D as the GPGPU api. It is too low level for that and exposes vasts parts of the graphics pipeline. What you (or "I" with vast amount of convincing and promises of eternal love) would do is write a "state tracker". State trackers are pieces of code layered on top of Gallium3D which are used to do state handling for the public API of your choice. Any api layered like this will execute directly on the GPU. I'm not 100% certain whether this will cure all sickness and stop world hunger but it should do, what even viagra never could, for all GPGPU fanatics. The way this looks is a little like this:

This also shows an important aspect of Gallium3D - to accelerate any number of graphical API's or to create a GPU based non-graphics API, one doesn't need N number of drivers (with N being the number of API's), as we currently do. Gallium3D driver (that's singular!) is enough to accelerate 2D, 3D, GPGPU and my blog writing skills. What's even better is that of the aforementioned only the last one is wishful thinking.

So one would create some nice dedicated GPGPU api and put it on top of Gallium3D. Also since Gallium3D started using LLVM for shaders, with minimal effort it's perfectly possible to put any language on top of GPU.

And they lived happily ever after... "Who" did is a detail, since it's obvious they lived happily ever after thanks to Gallium3D.

18 comments:

Anonymous said...

Tokyo sounds like fun! Maybe you could meet up with Raster. Two of the best graphics hackers in one place? Hey it could be some pretty cool result!

stripe4 said...

What?

Boris Dušek said...

Hi Zack, I have 2 questions:

1. Do I understand correctly that Gallium 3D will make it easier to implement DirectX on Linux (i.e. that it would be written only once for all GPUs since all the GPU-specific stuff is wrapped in Gallium 3D API)? If yes, are you aware of any plans to do so (write from scratch/use some Wine code)? Sorry if this questions does not make much sense, I don't know much about all this stuff.

2. I know you're gonna to hate this question, but it must be asked :-) Any estimated time left for Gallium to be production-ready? (i.e. all major GPU drivers would be rewritten to Gallium 3D and X.org/Mesa/Linux distros will start using/shipping it as _the_ way of doing things)?

Thanks and keep up the good work!

mmmm said...

Boris Dušek: Why the hell do you want DirectX on Linux??? DirectX is disgusting proprietary API, OpenGL is _much_ better and it is the standard for 3D graphics.

Or did you want only DirectX in Wine to be directly accelerated by Gallium3D (instead of Direct3D to OpenGL conversion, which is done today)? This is much more feasible idea...

Boris Dušek said...

mmmm: AFAIK, OpenGL is standard for 3D graphics, except that DirectX is standard for 3D graphics in games. I asked question 1 with this in mind. I could of course be wrong with my assumption, but that's what I read.

Zack said...

Boris Dušek:
1) DirectX is a complete stack that is a lot more than 3D graphics. One could certainly use Gallium3D for Direct3D part of it. Either as a Windows specific driver, or within Wine. DirectX depends and assumes windows specific api's so only these two combinations for Direct3D would work. At Tungsten we are using Gallium3D for Windows Direct3D work. I'm not aware of any efforts to bring Direct3D to GNU/Linux on top of Gallium3D though.

2) It's hard to say, if I'd have to guess I'd say within a year.

ビクトル said...

Hi Zack, I've been reading your post for a while in planet kde.

I'm an Spaniard living as well in Tokyo, don't know so much about graphics but I'm a Linux & KDE fan, so if you wanna meet some day to drink some beers, eat something or whatever contact me! :) ok? cya!

ruphy said...

Wow, that's awesome!
How much overhead would Gallium3D introduce doing its wrapper work?
Anyway, great, this really makes life much much easier...

Danny said...

From a scientific/HPC point of view, I don't like to unify and automate these backends.

I want reproducible results. Gallium optimising (and thus changing) my results unpredictably is not in my interest. Reproducibility is paramount.

I'd rather have a compiler that creates my shaders externally.
The ideal API would then just accept my pre-compiled shader.

If you want to discuss this further, just message me on freenode IRC. My nick is 'dvandyk'.

bartxxx said...

Tokyo? I heard from my sister Ann (Tomek's wife) about your plans. Fuck yeah. Tokyo is the best place on the world! I'm to travel from Europe to Japan by hitch-hicking next year.

d3ce1t said...

Hi Zack. I have just heard about Gallium3d for the first-time. I'm impressed. I had known about Glucose, how does it fit in all this scenario?

Vlad said...

I am not sure what I just read, but it is amazing

DrYak said...

A little late comment, just to say "thank you" to you for your answers about GPGPU. It's exactly the information I needed (as I asked in a comment a couple of blog entry back) and the kind of wet dream I was hoping about :
- Gallium3D as a nice single backend to target for whatever GPGPU API is around (Brook+ for example)
- and LLVM as an optimizer for the code running on GPU to guarantee optimal execution performance.

(or "I" with vast amount of convincing and promises of eternal love)

I don't have virgins handy anymore to give you for some eternal love, but would some virgin-shaped chocolate statues do the trick ?
The melt faster than the real one, but chocolate is probably the next best thing after sex.

Anonymous said...

Man, I'm busy with gpgpu steady for a couple months now... and it's a facinating technology. I do wish it would become something like CUDA, a bit abstracted away in old fashion C.

Maybe one day we can remove opengl and directX and just implement our own API's on the gpu :)

-- Mark Czubin

Anonymous said...

@Mark Czubin

For high level abstraction of GPGPU API, Brook/Brook+ might be a good idea.

DrYak said...

Well, the current problem with BrookGPU/Brook+ is - in case you want GPU accelerated computation - you have to either use the CAL backend which has lots of interesting capabilities (scatters, multiple output stream, integer streams, etc.) but is specific to ATI hardware OR you use the OpenGL/GLSL backend which is vendor-neutral, but is pretty much high-level and only expose functionality that exist as a legit polygon operation in OpenGL. In that case you miss lots of operations which are very useful for streams and are indeed supported by the hardware underneath, but don't correspond to practical polygon operations.

To be able to do such kind of things, one should need a lower level API (exactly like CAL for Brook+, or like the driver API that exist in CUDA) but in a standardised interface. Gallium 3D could be an interesting solution.

The Gallium API itself is supposed to expose basic building block of technology that is supported by modern shader. Not an actual 3D API. By using low level shader blocks, perhaps even maybe functions that are exposed by Gallium but aren't used by any of the Mesa/OpenGL3/WineD3D/etc... high-level apis, one could implement more functions in a potential Brook+ Gallium back-end in a more vendor neutral way.

Currently the CPU and OpenMP back end is the only alternative for such kind of advanced functions.

In addition, the LLVM Gallium plugin would come handy (as I've said before) because :
- It can optimise code, which is very use ful.
- It already has a front end able to understand C. This could make life more easy for a potential Brook+/Gallium front end. Currently, you can have long chains as :
kernel Brook C dialect -> Translated to Cg by BRCC -> converted into GLSL by Cg compiler -> uploaded to OpenGL by BrookGPU/OpenGL+GLSL frontend -> compiled by driver -> kernel is run.
With LLVM you could imagine the Gallium LLVM plugin able to compile the C dialect of Brook's kernels without needing to hoop through a couple of other languages first.

Anonymous said...

I agree with DrYak, since I'm been busy with Shaders and CUDA for image processing. The difference is remarkable.

Just to show, where Shaders completely fail:

Let's say I need to process my data in a certain order, from left to right over a row. Let's say, I'm doing a sum prefix operation i.e. y(i) = y(i-1) + x(i).

With shaders, I have to do this with line fragments. With CUDA I can place y(i) in a local variable (register) and move on to the next data element. And do this in parallel for all my rows. (need scatter as well)

Another example is a simple hair wavelet. Per 2 element I would need to save the difference and the average... with Shaders that means doing the same operation twice because we can't do scatter.

I do have to look at brook+ (AMD) some day.

I wish someone would kickstart a CUDA project for all three mayor gfx cards(thinking about larabee not normal intel gfx)

btw: not to mention the overhead by doing everything through the graphics api.

-- mark czubin

DrYak said...

@ mark>> CUDA for three :

I've played both with BrookGPU and CUDA.

CUDA is nice for the kernels (it's just plain C code with a couple of extensions to call them and specify memory layout), but I have some criticism :

- the main (host) code is an ugly pile of very nVidia specific calls. One has to take care of all mid-level handling (cudamalloc, texture binds, etc...)
in BrookGPU you just have a syntax extension and a new data type the "stream" everything else (allocing, binding, etc.) can be handled behind the scene.

- the kernels lack lots of syntactic sugar : you have to do all the pointer arithmetics and calculating position using thread nums.
BrookGPU instead has "[i][j][k]" for random access to 3D arrays (without needing to alias pointers first) and "<x><y>" for a 2D array which automatically access the position corresponding.

- some additional memory access weirdness : device, const and shared are handled normally with pointers (add or take a couple of keyword to specify location), but textures are accessed using functions.

BrookGPU presents textures using the normal "[]" and "<>" syntax.

- CUDA's "device emulation mode" is a joke speed-wise. It's awfully slow and is only usable to debug programs using gdb or valgrind.
BrookGPU has a CPU backend, and there are efforts (in latest beta and CVS) to use OpenMP to take advantage of multiple cores.

So it may not be an optimal solution for the other vendors to follow.
Plus CUDA is closed source and mainly geared toward nvidia's hardware. Both BrookGPU and Brook+ are opensource and are designed from the ground up to accept multiple back-ends.


The main advantage of CUDA is that it has a lot of marketing going behind, nVidia is producing rackmount units with several GPUs inside (whereas ATI only proposes FireStorm discrete cards and Larrabe isn't existing at all), and has an already stable product (although with an ugly host-side code) whereas BrookGPU still had some rough edges and minor parsing bugs here and there.

That's why my current project was designed in CUDA.