Various distributed rendering examples

The following are examples of distributed rendering of either single high resolution images or animation sequences.

Compiled by Paul Bourke


Spiral Vase

Model and animation by Dennis Miller
Copyright © 1999

Software: PovRay and MPI
Custom distributed animation tools by Paul Bourke
CPU cycles from a 64 node Dec Alpha farm
September 1999

There are a number of ways one can render an animation using a standard package (such as PovRay) and exploit a collection of computers. For efficiency a prime objective is to ensure the machines are all busy during the time it takes to render the entire animation. The most straightforward way of distributing the frames is to send the first N frames to the first computer, the second N frames to the next computer and so on, where N is the total number of frames divided by the numbers of computers available.

The problem with this approach is that most animations don't have a uniform rendering time per frame. The above example illustrates this point, the frames at the start of the animation take seconds to render while other frames take many minutes. Sending contiguous time chunks to each computer means that many of the computers will be standing idle while the animation will not be complete until those processing the more demanding pieces are finished. A much more balanced approach is to interleave the frames as illustrated below

There are a number of ways this can be accomplished using PovRay, the method chosen here was to create an ini file containing all relevant/required settings including Initial_Clock, Final_Clock, Initial_Frame, and Final_Frame. Scripts are then created that invoke PovRay on each machine (using rsh) with the appropriate +SFn and +EFn command line arguments with the appropriate value of n.

An alternative is to to develop tools using one of the parallel processing tools available, for example, MPI or PVM. Using this approach a "master" processes can additionally check the load on the available machines and only distribute the next frame given some suitably low value. This can be particularly important in a multi-user environment where simply "nicing" the rendering may not be user friendly enough, for example, the renderings may require significant resources such as memory.




Waves

Model and animation by Dennis Miller
Copyright © 1999

Software: PovRay and MPI
Custom distributed animation tools by Paul Bourke
CPU cycles from a 64 node Dec Alpha farm
September 1999





Water Sun

Model and animation by Dennis Miller
Copyright © 2000

Software: MegaPov and MPI
Custom distributed animation tools by Paul Bourke
CPU cycles from a 64 node Dec Alpha farm
April 2000




Spacecraft Hangar

Model and animation by Justin Watkins
"Scout craft taking off from a maintenance hangar on Luna Base 4"
Copyright © 1999

Software: PovRay and MPI
Custom distributed animation tools by Paul Bourke
CPU cycles from a 64 node Dec Alpha farm
September 1999




Glass Cloud

Model and animation by Morgan Larch
Copyright © 1999

Software: PovRay and MPI
Custom distributed animation tools by Paul Bourke
CPU cycles from a 64 node Dec Alpha farm
September 1999




Addict

Modelling/rendering by Rob Richens
Copyright © 2001

Concept and direction by Dale Kern.
Rendered using PovRay
Additional software: Poser 2
Trees, Makelamp, and Grass macros by Giles Tran.
Spray macro by Chris Colefax.

Distributed rendering by Paul Bourke
Using locally developed scripts and CPU cycles on a 32 processor PIII and 64 processor Dec Alpha farm.
Astrophysics and Supercomputing, Swinburne University of Technology.
November 2001




Escape

Modelling/rendering by Rob Richens
Copyright © 2006

Concept and direction by Dale Kern.
Rendered using PovRay, Additional software: Poser 2

Distributed rendering by Paul Bourke
Using locally developed scripts and the PBS batch queue system.
The image is rendered across a large number of processors
as a series of narrow strips which are reassembled at the end.
CPU cycles from a 176 CPU SGI Altix, IVEC, Western Australia.
November 2006




Parallel Rendering

Using POVRAY on a Computer Cluster

Written by Paul Bourke
Example model courtesy of Stèfan Viljoen
Trees by Paul Dawson, lens flare and city by Chris Colefax.
August 1999

Given a number of computers and a demanding POVRAY scene to render, there are a number of techniques to distribute the rendering among the available resources. If one is rendering an animation then obviously each computer can render a subset of the total number of frames. The frames can be sent to each computer in contiguous chunks or in an interleaved order, in either case a preview (every N'th frame) of the animation can generally be viewed as the frames are being computed. Typically an interleaved order is preferable since parts of the animation that may be more computationally demanding are split more evenly over the available computing resources.

In many cases even single frames can take a significant time to render. This can occur for all sorts of reasons: complicated geometry, sophisticated lighting (eg: radiosity), high antialiasing rates, or simply a large image size. The usual way to render such scenes on a collection of computers is to split the final image up into pieces, rendering each piece on a different computer and sticking the pieces together at the end. POVRAY supports this rendering by the ini file directives

    Height=n
    Width=n
    Start_Row=n
    End_Row=n
    Start_Column=n
    End_Column=n

There are a couple of different ways an image may be split up, by row, column, or in a checker pattern.

It turns out that it is normally easier to split the image up into chunks by row, these are the easiest to paste together automatically at the end of the rendering. Unfortunately, the easiest file formats do deal with in code are TGA and PPM, in both these cases writing a "nice" utility to patch the row chunks together is frustrated by an error that POVRAY makes when writing the images for partial frames. If the whole image is 800 x 600 say and we render row 100 to 119, the PPM file should have the dimensions in its header state that the file is 800 x 20. Unfortunately it states that the image is 800 x 600 which is obviously wrong and causes most image reading programs to fail! I'd like to hear any justification there might be for this apparently trivial error by POVRAY. [A fix for this has been submitted by Jean-François Wauthy for PNG files, see pngfix.c]

For example the ini file might contain the following
    Height=600
    Width=800
    Start_Row=100
    End_Row=119

A crude C utility to patch together a collection of PPM files of the form filename_nnnn.ppm, is given here (combineppm.c). You can easily modify it for any file naming conventions you choose that are different from those used here.

So, the basic procedure if you have N machines on which to render your scene is to create N ini files, each one rendering the appropriate row chunk. Each ini file creates one output image file, in this case, in PPM format. When all the row chunks are rendered the image files are stuck together.

How you submit the ini files to the available machines will be left up the reader as it is likely to be slightly different in each environment. Two common methods are: using rsh with the povray command line prompt, or writing a simple application for parallel libraries such as MPI or PVM. The later two have the advantage that they can offer a degree of automatic error recovery if a row chunk fails to render for some reason.

The following crude C code (makeset.c) illustrates how the ini files might be automatically created. Of course you can add any other options you like to the ini file. The basic arithmetic for the start and stop row chunks is given below, note that POVRAY starts its numbering from row 1 not 0!

    Start_Row=(i * HEIGHT) / N + 1
    End_Row=((i + 1) * HEIGHT) / N

With regard to performance and efficiency.....if each row chunk takes about the same time to render then one gets a linear improvement with the number of processors available. (This assumes that the rendering time is much longer compared to the scene loading time). Unfortunately this is not always the case, the rows across the sky might take a trivial length of time to render while the rows that intersect the interesting part of the scene might take a lot longer. In this case the machines that rendered the fast portions will stand idle for most of the time. For example consider the following scene rendered simultaneously in equal height slices on 48 machines, there is over a factor 1000 in the rendering time for the dark row chunks compared to the lighter shaded row chunks. In this case the cause is easy to determine, any row with a tree blows out the rendering time.

One way around this is to split the scene into many more row chunks than there are computers and write a utility that submits jobs to machines as they become free. This isn't hard to do if you base your rendering around rsh and it is reasonably easy with MPI or PVM once you understand how to use those libraries.

Another and perhaps neater way is to create row chunks of different heights according to the estimated rendering time. The script used to render the final image can be modified to render a much smaller image with the same number of row chunks. An estimate of the time for the different rows can be used to create narrow row chunks in the complex regions and wide row chunks in the fast rendering regions. So, for the example above, the bottom part of the image might be rendered with row chunks only a few pixels high while the top portion would be rendered with much taller chunks.

Note
There is an inefficiency as the chunks become narrower and antialiasing is used. PovRay will normally reuse traced rays when it can for adjacent pixels, for example, the filled circles below are only calculated once when PovRay is calculating the pixels shown (blue) but they will calculated twice if the image is split and the pixels calculated separately.

All these techniques assume one has a scene that takes a significant time to render. The experimentation described above was performed on a scene provided by Stefan Viljoen that for fairly obvious reasons (trees) is extremely CPU demanding. If you would like to experiment yourself, the model files are provided here (overf.tar.gz). The scene, as provided, rendered with the following ini file took just over 16 hours on a farm of 48 identical DEC XP1000 workstations.

    Width=1200
    Height=900
    Antialias=On
    Antialias_Threshold=0.3
    Output_File_Name=overf.ppm
    Input_File_Name=overf.pov
    Output_File_Type=P
    Quality=9
    Radiosity=off
A reduced version of the rendering is shown below




Load Balancing for Distributed PovRay Rendering

Written by Paul Bourke

MegaPovPlus model graciously contributed by Gena Obukhov.

Rendered using MegaPovPlus on the
Swinburne University Astrophysics and Supercomputing farm of
64+ Dec Alpha processors.

August 2000

When distributing rendering among various machines in a farm, cluster, or SMP machine it is critical to get the most out of the available processors. Keeping all processors busy is known as "load balancing".

To illustrate why some sort of load balancing is desirable consider the following image rendered in PovRay. The whole image at a reasonable resolution takes in the order of hundreds of hours to render, the details aren't important suffice to say there are often scenes that require significant rendering time.

One way to render this scene on a number of computers is to split it into N column strips and send each strip to a different machine. In this example the strips are shown in red, each is 1/16 of the image wide since the scene is distributed to 16 machines. The strips as they are completed are stitched back together to form the complete image. The first thing one finds with this and any other scheme for splitting up an image is that some pieces render faster than others, often the differences are significant.

The profile at the top of the above image gives the approximate rendering time for each strip. For this scene it is easy to see that the rendering time is dependent on the number of "bubbles" in the strip. So, by splitting up the image in these equal size portions the whole image isn't ready until the slowest strip has finished, in this case the second strip from the left. More importantly, the machines rendering the "easy" strips are sitting idle (a mortal sin in parallel processing)!

A straightforward way of remedying the situation is to divide up the strips such that their width is inversely related to the rendering time (time consuming strips are small and less time consuming strips are wide). This is shown below along with the resulting new time profile in green. The flatter the profile the better the load averaging and the faster the rendering is completed.

The only remaining issue is how does one determine or estimate the time profile and therefore the width of each strip. The approach taken was to pre-render a small image, 64 pixels wide say, render it in one pixel wide columns timing each column. The time estimates from this preview were fitted by a Bezier curve which is then used to compute the strip width for the large rendering. A continuous fit such as a Bezier is required since the determination of the strip width is most easily done by integrating the time curve and splitting it into N equal area segments.

More processes isn't always better

As a word of warning, when distributing strips to machines in a rendering farm, more machines doesn't necessarily mean faster rendering times. This can come about when there is a significant model sizes and the model files reside on one central disk. Consider a scene that takes 2 hours to render and 1 minute to load. As the number of machines (N) is increased the rendering time (assuming perfect load balancing) drops by 1/N but the model loading is limited by the disk bandwidth and when saturated the rendering machines will wait for a time proportional to N. This is a slightly surprising result and is relevant to many rendering projects where the scene descriptions (especially textures) can be large. Note that a common option is to stagger the times that the rendering machines start so they are not all reading from disk. While this certainly helps the performance of the individual nodes it doesn't improve overall rendering time.

Antialiasing warning

There is a further consideration when rendering strips in PovRay, that is, antialiasing. The default antialiasing mode is "type 1" which is adaptive, non-recursive super-sampling. A quote from the manual "POV-Ray initially traces one ray per pixel. If the colour of a pixel differs from its neighbours (to the left or above) by more than a threshold value then the pixel is super-sampled by shooting a given, fixed number of additional rays." The important thing here is that this form of antialiasing isn't symmetric but depends on the left/top transition. If this is used when rendering the strips the images won't perfectly combine when joined together. This can be seen below, the scene is split into two halves, the image on the left uses type 1 and the one on the right uses type 2, note the thin band at the seam on the image on the left.


Sampling_Method=1

Sampling_Method=2

Another solution is to simply render the image at a higher resolution without any antialiasing and subsample the final image with an appropriate filter, eg: Gaussian.




Insomnia

July-August 2002 First Place Winner, July-August 2002
Internet Raytracing Competition

Design and model (PovRay 3.5) by Gena Obukhov
August 2002

Rendered by Paul Bourke at
Swinburne University Astrophysics and Supercomputing.

Radiosity is well known to be a CPU intensive process, certainly this model using radiosity in PovRay 3.5 is no exception. It was rendered across 110 (mostly P4) processors running Linux which form the Swinburne Astrophysics cluster. The final image rendered at 1024x768 (with reasonable level of antialiasing) took about 2 hours. Load averaging was achieved by distributing 2 pixel rows to each processor, all the processors finished within 5 minutes of each other. The software to achieve this was developed inhouse, it consists of scripts to create multiple PovRay command files (essentially strips made by +SRnnnn +ERnnnn), a batch system that distributes these command files to all the nodes of the cluster as they become idle, and finally a script to put all the images pieces together at the end.

PovRay 3.5 radiosity settings
global_settings {
   radiosity {
      pretrace_start 0.08
      pretrace_end   0.01
      count 500
      nearest_count 10
      error_bound 0.02
      recursion_limit 1
      low_error_factor 0.2
      gray_threshold 0.0
      minimum_reuse 0.015
      brightness 1
      adc_bailout 0.01/2
   }
}
sky_sphere{
   pigment { color rgb <0.1,0.4,1>*1.45 }
}
 
No radiosity


Radiosity

(Click above for higher resolution version)