Parallelization Strategies for Commodity Hardware

To evaluate the performance of our parallelization strategies, we used the same test system as in the previous section. This system has two CPUs and supports Hyper-Threading.

Our system is able to force threads on specific physical and logical CPUs. By following this mechanism we tested different configurations to obtain figures for the speedup achieved by the presented techniques. All test runs consistently showed the same speedup factors. Testing Simultaneous Multithreading on only one CPU showed an average speedup of 30%. While changing the viewing direction, the speedup varies from 25% to 35%, due to different transfer patterns between the level 1 and the level 2 cache. Whether Hyper-Threading is enabled or disabled adding a second CPU approximately reduces the computational time by 50%, i.e., Symmetric Multiprocessing and Simultaneous Multithreading are independent. This shows that our Simultaneous Multithreading scheme scales well on multi-processor machines. The Hyper-Threading benefit of approximately 30% is maintained if the second hyper-threaded CPU is enabled.

For different block sizes, the speedup for Simultanous Multithreading varies. The speedup significantly decreases with larger block sizes. Once the level 2 cache size is exceeded, the two threads have to request data from main memory. Therefore, the CPU execution units are less utilized. Very small block sizes suffer from a different problem. The data fits almost into the level 1 cache. This means that one thread can utilize the execution units more efficiently, and the second thread is idle during this time. But the overall disadvantage is the inefficient usage of the level 2 cache. The optimal speedup $\frac{{100}}{{100 - 30}} \approx 1.42$ is achieved with a block size of 64 KB ( $32 \times 32 \times 32$). This is also the optimal block size for the bricked volume layout.