Addressing

In raycasting the eight samples closest to the current resample location are required in every processing step when trilinear interpolation is used. In a linear volume layout, these samples can be addressed by adding constant offsets to one known address. The necessary address computations are given in Algorithm 1.


\begin{algorithm}
% latex2html id marker 777\footnotesize
\caption{ComputeSa...
...{x} + V_{x} \cdot V_{y}$\\
\end{tabular} \end{algorithmic}\par
\end{algorithm}

$V_{\{x,y,z\}}$ are the volume dimensions and $i$, $j$, $k$ are the integer parts of the current resample position. This addressing scheme is very efficient. Once the lower left sample is determined the other needed samples can be accessed just by adding an offset.

The evolution of CPU design shows that the length of CPU pipelines grows progressively. This is very efficient as long as conditional branches do not initiate pipeline flushes. Once a long instruction pipeline is flushed there is a significant delay until it is refilled. Most of the present systems use branch prediction. The CPU normally assumes that if-branches will always be executed. It executes the if-branch before actually checking the outcome of the if-clause. If the if-clause returns false, the else-branch has to be executed. This means that the CPU flushes the pipeline and refills it with the else-branch. This is very time consuming, due to the increasing size of the pipelines.


\begin{algorithm}
% latex2html id marker 817\footnotesize
\caption{ComputeAd...
...} $(u \cdot B_x \cdot B_y \cdot B_z + v)$
\end{algorithmic}\par
\end{algorithm}


\begin{algorithm}
% latex2html id marker 830\footnotesize
\caption{ComputeSa...
...d}($i+1$,$j+1$,$k+1$)\\
\end{tabular} \ENDIF
\end{algorithmic}\end{algorithm}

Using a bricked volume layout one will encounter this problem. The addressing of data in a bricked volume layout is more costly than in a linear volume layout. To address one data element, one has to address the block itself and the element within the block. The necessary address computation is given in Algorithm 2. $B_{\{x,y,z\}}$ are block dimensions, $V_{\{x,y,z\}}$ are the volume dimensions, $i$, $j$, and $k$ are the integer parts of the current resample position.

In contrast to this addressing scheme, a linear volume can be seen as one large block. To address a sample it is enough to compute just one offset. In algorithms like volume raycasting, which need to address a certain neighborhood of data in each processing step, the computation of two offsets has a considerable impact on performance. To avoid this performance penalty, one can construct an if-else statement (see Algorithm 3). The if-clause checks if the needed data elements can be addressed within one block. If the outcome is true, the data elements can be addressed as fast as in a linear volume. If the outcome is false, the costly address calculations have to be done. This simplifies address calculation, but the involved if-else statement incurs pipeline flushes. In the following, we address this problem.

Figure 3.9: Access patterns during resampling and gradient computation. (a) typical access pattern during resampling (8-neighborhood). (b) typical access pattern during gradient computation (26-neighborhood).
\includegraphics{algorithm/images/samplelut} \includegraphics{algorithm/images/gradientlut}
(a) (b)

To avoid the costly if-else statement and the expensive address computations, one can create a lookup table to address all the needed samples. The straight-forward approach would be to create a lookup table for each possible sample position in a block. For a block of $32^3$ this would lead to $32^3$ different lookup tables to address the neighboring samples. In the resampling case, 7 neighbors need to be addressed - accordingly the size of the lookup tables would be $32^3 \cdot 7 \cdot 4$ bytes = 896 KB (4 bytes per offset). For accessing a 26-neighborhood a table of $32^3 \cdot 26 \cdot 4$ bytes = 3.25 MB (4 Bytes per offset) would be required. Such a large lookup table is not preferable, due to the limited size of cache. However, the addressing of such a lookup table would be straight-forward, because the indices in the lookup table would be the corresponding offsets of the current sample position, assuming the offset is given relative to the block memory address.

To reduce the size of the lookup table, the possible sample positions can be distinguished by the locations of the needed neighboring samples. The first sample location $(i, j, k)$ is defined by the integer parts of the current resample position. Assuming trilinear interpolation, during resampling neighboring samples to the right, top, and back of the current location are required. A block can be subdivided into subsets. For each subset, we can determine the blocks in which the neighboring samples lie. Therefore, it is possible to store these offsets in a lookup table [9]. This is illustrated in Figure 3.9 (a). We see that there are four basic cases, which can be derived from the current sample location. This can be mapped straightforwardly to 3D, which leads to eight distinct cases. These eight cases are shown in Table 3.2.


Table 3.2: Cases for the position within a block for 8-neighborhood addressing. For every case, the range of each component of the three-dimensional position within a block is displayed.
Case $(i \bmod B_x) \in$ $(j \bmod B_y) \in$ $(k \bmod B_z) \in$
0 $\{0,...,B_{x}-2\}$ $\{0,...,B_{y}-2\}$ $\{0,...,B_{z}-2\}$
1 $\{0,...,B_{x}-2\}$ $\{0,...,B_{y}-2\}$ $\{B_{z}-1\}$
2 $\{0,...,B_{x}-2\}$ $\{B_{y}-1\}$ $\{0,...,$B$_{z}-2\}$
3 $\{0,...,B_{x}-2\}$ $\{B_{y}-1\}$ $\{B_{z}-1\}$
4 $\{B_{x}-1\}$ $\{0,...,$B$_{y}-2\}$ $\{0,...,$B$_{z}-2\}$
5 $\{B_{x}-1\}$ $\{0,...,$B$_{y}-2\}$ $\{B_{z}-1\}$
6 $\{B_{x}-1\}$ $\{B_{y}-1\}$ $\{0,...,B_{z}-2\}$
7 $\{B_{x}-1\}$ $\{B_{y}-1\}$ $\{B_{z}-1\}$


In the following, we construct a function to efficiently address the lookup table. The input parameters of the lookup table addressing function are the sample position $(i, j, k)$ and the block dimensions $B_x$, $B_y$, and $B_z$. We assume that the block dimensions are a power of two, i.e., $B_x = 2^{N_x}$, $B_y = 2^{N_y}$, and $B_z = 2^{N_z}$. As a first step, the block offset parts of $i$, $j$, and $k$ are extracted by a conjunction with the corresponding $B_{\{x, y, z\}}-1$. The next step is to increase all by one to move the maximum possible value of $B_{\{x, y, z\}}-1$ to $B_{\{x,y,z\}}$. All the other possible values stay within the range $[1,B_{\{x, y, z\}}-1]$. Then a conjunction of the resulting value and the complement of $B_{\{x, y, z\}}-1$ is performed, which maps the input values to $[0, B_{\{x, y, z\}}]$. The last step is to add the three values and divide the result by the minimum of the block dimensions, which maps the result to [0,7]. This last division can be exchanged by a shift operation. Algorithm 4 shows the lookup table index computation and Algorithm 5 shows how the neighboring sample offsets can be computed using the lookup table. We use $\&$ to denote a bitwise and operation, $\vert$ to denote a bitwise or operation, $\gg$ to denote a right shift operation, and $\sim$ to denote a bitwise negation. In Table 3.3 we give an example of the calculation for a block size of $32 \times 16 \times 8$.


\begin{algorithm}
% latex2html id marker 949\footnotesize
\caption{ComputeRe...
...n} $((i' + j' + k') \gg \min (N_x ,N_y ,N_z))$
\end{algorithmic}\end{algorithm}


\begin{algorithm}
% latex2html id marker 956\footnotesize
\caption{ComputeSa...
...mple_{i,j,k} + LUT[index][6]$\\
\end{tabular} \end{algorithmic}\end{algorithm}


Table 3.3: Lookup table index calculation for 8-neighborhood. The calculation of the lookup table index is shown for $B_x = 32$, $B_y = 16$, $B_z = 8$
Case $i_a$ $j_a$ $k_a$ $i_b$ $j_b$ $k_b$ $i'$ $j'$ $k'$ $index$
0 0-30 0-14 0-6 1-31 1-15 1-7 0 0 0 0
1 0-30 0-14 7 1-31 1-15 8 0 0 8 1
2 0-30 15 0-6 1-31 16 1-7 0 16 0 2
3 0-30 15 7 1-31 16 8 0 16 8 3
4 31 0-14 0-6 32 1-15 1-7 32 0 0 4
5 31 0-14 7 32 1-15 8 32 0 8 5
6 31 15 0-6 32 16 1-7 32 16 0 6
7 31 15 7 32 16 8 32 16 8 7

$\begin{array}{rcl}
\\
index & = & ((i' + j' + k') \gg \min (N_x ,N_y ,N_z))\\ ...
..._a,k_a\}}} + 1)}_{{\{i_b,j_b,k_b\}}} \;\&\; \sim(B_{\{x,y,z}\} - 1)
\end{array}$



Table 3.4: Lookup table index calculation for 26-neighborhood. The calculation of the lookup table index is shown for $B_x = 32$, $B_y = 16$, $B_z = 8$.
Case $i_a$ $j_a$ $k_a$ $i_b$ $j_b$ $k_b$ $i_c$ $j_c$ $k_c$ $i'$ $j'$ $k'$ $index$
0 1-30 1-14 1-6 0-29 0-13 0-5 2-30 2-14 2-6 0 0 0 0
1 1-30 1-14 7 0-29 0-13 6 2-30 2-14 8 0 0 1 1
2 1-30 1-14 0 0-29 0-13 15 2-30 2-14 16 0 0 2 2
3 1-30 15 1-6 0-29 14 0-5 2-30 16 2-6 0 1 0 3
4 1-30 15 7 0-29 14 6 2-30 16 8 0 1 1 4
5 1-30 15 0 0-29 14 15 2-30 16 16 0 1 2 5
6 1-30 0 1-6 0-29 31 0-5 2-30 32 2-6 0 2 0 6
7 1-30 0 7 0-29 31 6 2-30 32 8 0 2 1 7
8 1-30 0 0 0-29 31 15 2-30 32 16 0 2 2 8
9 31 1-14 1-6 30 0-13 0-5 32 2-14 2-6 1 0 0 9
10 31 1-14 7 30 0-13 6 32 2-14 8 1 0 1 10
11 31 1-14 0 30 0-13 15 32 2-14 16 1 0 2 11
12 31 15 1-6 30 14 0-5 32 16 2-6 1 1 0 12
13 31 15 7 30 14 6 32 16 8 1 1 1 13
14 31 15 0 30 14 15 32 16 16 1 1 2 14
15 31 0 1-6 30 31 0-5 32 32 2-6 1 2 0 15
16 31 0 7 30 31 6 32 32 8 1 2 1 16
17 31 0 0 30 31 15 32 32 16 1 2 2 17
18 0 1-14 1-6 63 0-13 0-5 64 2-14 2-6 2 0 0 18
19 0 1-14 7 63 0-13 6 64 2-14 8 2 0 1 19
20 0 1-14 0 63 0-13 15 64 2-14 16 2 0 2 20
21 0 15 1-6 63 14 0-5 64 16 2-6 2 1 0 21
22 0 15 7 63 14 6 64 16 8 2 1 1 22
23 0 15 0 63 14 15 64 16 16 2 1 2 23
24 0 0 1-6 63 31 0-5 64 32 2-6 2 2 0 24
25 0 0 7 63 31 6 64 32 8 2 2 1 25
26 0 0 0 63 31 15 64 32 16 2 2 2 26
$\begin{array}{rcl}
\\
index & = & (9i' + 3j'+ k')\\
\{i',j',k'\} & = & \under...
...,j_b,k_b\}} \;\vert\; 1) + 1) }_{\{i_c,j_c,k_c\}} \gg N_{\{x,y,z\}}
\end{array}$


A similar approach can be done for the gradient computation. We present a general solution for a 26-connected neighborhood. Here we can, analogous to the resample case, distinguish 27 cases. For 2D, this is illustrated in Figure 3.9 (b). Depending on the position of sample $(i, j, k)$ a block is subdivided into 27 subsets.

The first step is to extract the block offset, by a conjunction with $B_{\{x, y, z\}}-1$. Then we subtract one, and conjunct with $B_{\{x,y,z\}} + B_{\{x,y,z\}} - 1$, to separate the case if one or more components are zero. In other words, zero is mapped to $2 \cdot B_{\{x,y,z\}}-1$. All the other values stay within the range $[0,B_{\{x,y,z\}}-2]$. To separate the case of one or more components being $B_{\{x, y, z\}}-1$, we add 1, after the previous subtraction is undone by a disjunction with 1, without loosing the separation of the zero case. Now all the cases are mapped to $\{0,1,2\}$ to obtain a ternary system. This is done by dividing the components by the corresponding block dimensions. These divisions can be replaced by faster shift operations. Then the three ternary variables are mapped to an index in the range of $[0,26]$. The final lookup table index computation is given in Algorithm 6. In Table 3.4 we give an example of the calculation for a block size of $32 \times 16 \times 8$.


\begin{algorithm}
% latex2html id marker 1019\footnotesize
\caption{ComputeG...
...N_z$
\STATE \textbf{return} $(9i' + 3j'+ k')$
\end{algorithmic}\end{algorithm}

The presented index computations can be performed efficiently on current CPUs, since they only consist of simple bit manipulations. The lookup tables can be used in raycasting on a bricked volume layout for efficient access to neighboring samples. The first table can be used if only the eight samples within a cell have to be accessed (e.g., if gradients have been pre-computed). Compared to the if-else solution which has the costly address computation in the else branch, we get a speedup of about 30%. The benefit varies, depending on the block dimensions. For a $32 \times 32 \times 32$ block size the else-branch has to be executed in 10% of the cases and for a $16 \times 16 \times 16$ block size in 18% of the cases.

The second table allows full access to a 26-neighborhood. This approach reduces the cost of addressing by 40% compared to a if-else solution, as the else-branch has to be executed more often for a larger neighborhood. The lookup table has a size of 27 cases $\cdot$ 26 offsets $\cdot$ 4 bytes per offset = 2808 bytes. This can be reduced by a factor of two due to symmetry reasons. Therefore we have a very small lookup table of 1404 bytes. This is an improvement of a factor of 2427 compared to the straight-forward solution.

Another possible option to simplify the addressing is to inflate each block by an additional border of samples from the neighboring blocks [11]. However, such a solution increases the overall memory usage considerably. For example, for a block size of $32 \times 32 \times 32$ the total memory is increased by approximately 20%. This is an inefficient usage of memory resources and the storage redundancy reduces the effective memory bandwidth. Our approach practically requires no additional memory, as all blocks share one global address lookup table.