Memory Efficient Acceleration Data Structures

To demonstrate the impact of our high-level optimizations we used a commodity notebook system equipped with an Intel Centrino 1.6 GHz CPU, 1 MB level 2 cache, and 1 GB RAM. This system has one CPU and does not support Hyper-Threading so the presented results only reflect performance increases due to our high-level acceleration techniques.

The memory consumption of the gradient cache is not related to the volume dimensions, but determined by the fixed block size. We use $32 \times 32 \times 32$ sized blocks, the size of the gradient cache therefore is is $(33)^3 \cdot 3 \cdot 4$ byte $\approx$ 422 KB. Additionally we store for each cache entry a validity bit, which adds up to $33^3 / 8$ bytes $\approx$ 4.39 KB.

The impact of our gradient caching scheme is determined by the zoom factor and the object sample distance. In case of zoom factor 1.0 we have one ray per cell, already here per block gradient caching performs better than per cell gradient caching. This is due to the shared gradients between cells. For zooming out both gradient caching schemes perform equally well. The rays are so far apart such that nearly no gradients can be shared. On the other hand, for zooming in, per block caching performs much better than per cell caching. This is due to the increased number of rays per cell. For a zoom factor of 2.0, per block gradient caching achieves a speedup of approximately 3.0 compared to no gradient caching at a typical object sample distance of 0.5.

The additional memory usage of the acceleration data structures is rather low. The cell invisibility cache has a size of $32^3$ bit = 4096 byte. The min-max octree has a depth of three storing 4 byte at each node (a 2 byte minimum and maximum value) and requires at most 2340 byte. Additionally, the classification information is stored, which requires 66 byte. We use blocks of size $32 \times 32 \times 32$ storing 2 bytes for each sample, which is a total of 65536 bytes. Our data structures increase the total memory requirements by approximately 10%.

Our tests have shown that a combination of the proposed optimizations achieves render times of about 2 frames per second for various large datasets.