Load Balancing

For every pass of the rendering algorithm, a list of blocks is distributed among the available processors. A straight-forward approach for this distribution is to simply divide the list into equally sized portions. This works well if the algorithm does not use any optimizations for skipping empty space. However, employing the optimizations presented in the previous sections can lead to a very unequal distribution, and thus, reduce the performance gains due to using multiple CPUs.

We use a simple estimate to determine the computational effort required for processing a whole block. For a list of $N$ blocks $L=\{B_0,...,B_{N-1}\}$ we compute for each block $B \in L$:


\begin{displaymath}
load(B) = \frac{{rays(B )}}{{1 + C_t \cdot transparency(B)}}
\end{displaymath} (3.16)

$rays(B_i)$ is the number of rays entering the block $B_i$ and $transparency(B_i)$ is a measure for the number of transparent samples in the block. An estimate measure for the transparency can be retrieved from the classified min-max octree. Depending on how many octree levels are taken into account, the estimate will become more accurate. For example, a simple way to obtain such a measure is to sum up the transparency information of the top-level octree nodes by counting fully transparent nodes as $\frac{1}{8}$, fully opaque nodes as $0$, and inhomogeneous nodes as $\frac{1}{16}$. The factor $C_t$ determines the fraction of processing cost between a fully opaque and a fully transparent block. If $C_t$ is 0, the cost of processing an opaque and a transparent block is the same, if $C_t$ is 1, the cost of processing a fully transparent block is half the cost of processing a fully opaque block.

We then want to find a partition of $L = L_0 \cup ... \cup L_{M - 1}$ for $M$ CPUs, so that the workload for each CPU is approximately equal. We sort all blocks $B \in L$ in descending order of $load(B)$. A greedy algorithm then traverses the sorted list and always assigns a block to the CPU with the smallest current load, until all blocks have been assigned to a CPU. This approximative algorithm results in better balanced loads while introducing little overhead.