Distributed OpenGL Rendering

Written by Paul Bourke
July 1996


Introduction

The following outlines a method of distributing an OpenGL model across a number of computers each with their own OpenGL rendering card. The OpenGL model might be distributed using MPI from a controlling machine (that need not have OpenGL capabilities). Each of the slaves render a subset of the geometry and send their draw buffers and depth buffers back to the controlling machine. These images are combined according to the values in their depth buffers.

The problem

While tremendous performance improvements (especially price performance) are being made in OpenGL cards, there will always be geometric models that bring the best card to it's knees. For example, a high performance card might be able to render 15 million triangular facets per second. For interactive rates of 20 frames per second this means that one can display geometry with 750,000 polygons, if one wishes to render in stereo that drops to models with 375,000 triangular polygons. While this might seem a large number of polygons for those in the virtual reality or games market, it is a relatively low polygon count for many scientific visualisation applications. As an example, the terrain model shown below of Mars contains over 11 million triangular polygons.

Possible Solution

A solution that follows the trends in cluster computing is to distribute the OpenGL rendering load among a number of machines. Fortunately, OpenGL maintains a depth buffer and this buffer is accessible to the programmer using glReadPixels(). The idea then is to split up the geometry making up the scene and distribute each piece to one OpenGL card, generally each card will be in a separate computer. Each card will then only need to render a subset of the geometry. For example, for a terrain model it is quite easy to split the polygons up evenly (the splitting of the geometry so that each OpenGL card is evenly loaded is not always trivial) so if there are N machines, each one only handles 1/N of the total number of polygons.

Each OpenGL card then renders it's portion of the geometry and sends the image and depth buffer to be merged into the final image. The logic for this is straightforward, set a pixel in the final image to the pixel in the subimage which has the smallest depth value, ie: is closest to the camera.

Example

Part 1
The model is made up of 3 parts, the first is the red core shown above. The next two pieces are shown below along with their corresponding depth buffers on the right.

Part 2
OpenGL maintains a depth buffer as long as glEnable(GL_DEPTH_TEST) has been called. This depth buffer gives the depth from the camera for each pixel in the draw buffer. Points at infinity are shown as white in this example, those parts of the object closer to the camera tend towards black. In other rendering applications the depth buffer is often called the z-buffer.

Part 3
The depth buffer is accessed through the routine glReadPixels(), using something like
glReadPixels(0,0,width,height,GL_DEPTH_COMPONENT,GL_FLOAT,depthbuffer);
where width and height are the dimensions of the window, and depthbuffer is malloc'ed with something like the following
depthimage = malloc(screenwidth*screenheight*sizeof(GL_FLOAT));
Indeed this is the same way the image is acquired, for example,
glReadPixels(0,0,width,height,GL_RGB,GL_UNSIGNED_BYTE,imagebuffer);

Composited image
This is the final image, the images on the left are combined depending on the values of the depth buffer. That is, any pixel in the final image comes from one of the three partial images depending on which one has the lowest depth value. For example, the pixels under the green cone that protrudes from the center to the right have lower depth values than the pixels in the same region under the magnetic field lines.

Limitation

The fundamental problem with this technique is bandwidth. Consider rendering a 800 x 600 RGB model at 20 frames per second. There are 3 bytes for each image pixel and 4 bytes for each depth buffer pixel. So to transmit an image/depthbuffer pair from one machine to another requires (3 + 4) * 800 * 600 bytes or just over 3MB. For interactive performance at 20 frames per second this requires a bandwidth of 60MB per second which is clearly more than the capabilities of all but the very highest performance networks. To make matters worse even higher bandwidth is required the more OpenGL cards participate in the rendering although bandwidth bottlenecks can be reduced by arranging the OpenGL cards/machines in a tree structure and combining the image/depthbuffer pairs as the image pieces move up the tree.

References

Sepia: Scalable 3D compositing using PCI Pamette
Laurent Moll, Alan Heinrich, Mark Shand
Compaq Computer Corporation, Systems Research Center, Palo Alto, California, USA.