VFleet Timings for the Cray T3D

This document describes optimizations and timing results for VFleet on the Cray T3D. The version of VFleet described is somewhere between version 1.0 and 1.1; these optimizations are not available in the 1.0 version. All measurements taken here were made in late June and early July of 1995.

Factors in Parallelizing VFleet

VFleet is a parallel volume renderer. When rendering in parallel, the volume to be rendered is divided by recursive bisection. In the current implementation, the subdivisions are of equal size. Each sub-volume is then rendered using a ray caster featuring various acceleration techniques, including early ray termination and an octree accuracy hierarchy. The images of the subvolumes produced are then composited together hierarchically to produce an image of the total volume.

Thus there are two phases to the rendering process. The ray casting phase happens independently on each processor, and thus is embarassingly parallel. The compositing phase happens hierarchically, and so will use the machine less efficiently than the ray casting process. (We omit from this discussion the process of loading data and the process of applying a transfer function to the raw data to produce a volume of color samples).

Several apparent inefficiencies arise from the use of acceleration techniques in VFleet. One is due to early ray termination. When a ray strikes an opaque surface, the final color of the ray is known and no further calculation along that ray is necessary. Thus if some opaque body is being rendered, the volume beyond the near surface of the body need never be traced. When this rendering task is done in parallel, however, processors must still be assigned to the hidden volume. Because the hidden volume changes every time the viewpoint changes, it is impractical to try to avoid assigning processors there. This inefficiency limits speedup to a factor of around 4 on 8 processors, as roughly half the processors will fall in the "visible" region at any given time. This can be partially overcome by dividing into more subvolumes than there are processors, but communications and algorithm limitations with the T3D implementation make this difficult.

Another inefficiency arises simply because acceleration techniques are used in the rendering of the subvolumes. Some subvolumes are simpler than others; the acceleration techniques allow those subvolumes to finish rendering sooner. This leads to load imbalances. Worse still, if one subvolume is very difficult to render, the acceleration techniques can actually make it render move slowly than would be possible otherwise. Since every subvolume must finish before the final compositing operation can occur, the rendering time will be at least as long as the slowest render.

T3D vs. Workstation Cluster Performance

On a workstation cluster connected by a low speed network, speedup for general renders using VFleet seems to top out around a factor of 3 on 8 processors. Beyond that point, increasing communication time for the subvolume images being composited offsets gains in rendering time. The high speed communication mechanisms of the T3D avoid this barrier.

Optimizations for the Cray T3D

In addition to a lot of general code tuning, the following T3D-specific optimizations were made:
  • Mods to avoid "select" operations by avoiding listening for PVM messages from beyond the T3D during rendering.
  • Use of shmem communications in passing images for compositing.
  • Optimization of compositing for the T3D.
  • Use of the Perflib SQRTI() routine in normalizing vectors.
Another optimization that was tried and rejected involved using cache prefetching to accelerate compositing. It was found that prefetching forced the loading of pixels which due to special cases (the front pixel being opaque, for example) did not actually need to be loaded.

Timing Results

One particular problem with optimizing for the T3D was that tiny changes to the code, like adding a diagnostic, could increase or decrease performance by 10% or so. Since this is larger than the effects of some optimizations, it is not always clear when progress is being made. Bandwidth between the T3D and C90 also varies greatly and unpredictably. As a result, the timings below should be taken with a grain of salt.

Three test cases were examined.

  • Buckyball iso, a 256x256x256 dataset showing the wave function of a carbon 60 molecule.
  • Engine opaque, a 128x128x128 dataset showing a CT scan of an engine block. The opacity of this image has been set very high to highlight problems caused by early ray termination.
  • Cosmology, a 128x128x128 dataset showing structure formation in the early universe. This dataset is semi-opaque and filamentary, so it is slow to render. It also occupies the entire volume, so load balancing is less of a problem in this case.
In each case the image was rendered at 300x300 resolution.

A speedup chart of these cases follows. These timings are for rendering computations within the T3D only. Application start-up times, times to compute the volume of color samples, and network transit time to move the final image from the C90 back to the workstation are excluded. CPU times are measured with the cpused() subroutine; wall clock times are measured with rtclock().

(The "Buckyball iso" dataset is too large to run on less than 8 PEs, and so is scaled differently).

The speed-ups seem to be in the range expected (around 4 on 8 PEs) when the number of processors is small, but above 16 PEs speed-up is not as good. We examine the "Engine opaque" in more detail to see why.

The following graph shows statistics for particular components of the rendering process for "Engine opaque" for higher numbers of PEs:

The top curve in this graph shows wall clock time for PE 0; the curve below shows total CPU time for PE 0. The interval between the two curves represents time spent on system calls, and most of it is probably attributable to the time needed for PVM to transmit the final output image (approximately 350K bytes) to the C90. This time can vary unpredictably from around 0.2 sec to over 1 second. In some cases this time can actually dominate execution time.

The other curves represent CPU times, rather than wall clock times.

The compositing time curve increases only very slowly, because the compositing operation is set up so that early compositions in the hierarchy (which involve only small obscured areas) run fairly quickly. Compositing time varies from about 0.4 to 0.5 second as the number of PEs is increased from 2 to 64.

The three blue curves show the minimum, maximum, and average time each processor spends on actual ray casting. The difference between minimum and maximum shows a large load imbalance. This imbalance is also reflected in the PE 0 wait time curve. Any time a load imbalance occurs, some processor must wait before finishing compositing, but which processor waits depends on the details of the model and the layout of subvolumes across PEs.

The following graph shows speedup for the ray casting operation only:

The load balancing problem is obvious here. While the average curve shows a speed-up in the expected range (26.2 over 64 PEs or 5.5 over 8 PEs), the load imbalance severely limits the maximum expected speed. The bottom curve shows a speed-up of 16.7 over 64 PEs or 4.1 over 8 PEs, and this curve will set the limit of overall performance.

CPU Time Breakdown and Further Optimizations

In general, a processor's total CPU time is the sum of ray casting time, compositing time, and wait time. PE 0 has the largest compositing load, so we examine that processor in the 64 PE "Engine Opaque" case. The specific CPU times for PE 0 in this case are:

0.98 sec. ray cast
0.46 sec. comp
0.93 sec. wait
----
2.37 sec. total CPU

The biggest improvement in this would be to find a way to balance the load. Ideally, this would eliminate the wait time and change the ray cast time to the average value for this run, which is 1.15 sec. This would produce a 32% speedup, or a CPU time of 67% or the original value. Unfortunately any load balancing would involve repartitioning the volume in a way that would be dependent on the mapping of data to colors and opacities, and so the load would have to be re-balanced frequently, at considerable computational expense.

Ma et. al. have described a method for improving parallelization of the compositing step in their paper A Data-Distributed, Parallel Algorithm for Ray-Traced Volume Rendering in the 1993 Parallel Rendering Symposium Proceedings. This method would be fairly easy to implement for the T3D, and would essentially eliminate compositing time. If this alone were done, it would produce a 19% speedup, and a rendering time of about 81% of the original value.

If both load balancing and Ma's compositing algorithm were implemented, the speedup would be 51%, for a CPU time of 49% of the original value. Better still, the number of processors that could be productively applied to the problem would be considerably increased.

Unfortunately this plan provides no method to control the run time variability and the time lost to slow system calls (seen as differences between CPU and wall clock time). As rendering time approaches these values further optimizations become futile.

VFleet Timings for the Cray T3D / Joel Welling, Pittsburgh Supercomputing Center / welling@psc.edu