Simultaneous Multithreading

Simultaneous Multithreading is a well-known concept in workstation and mainframe hardware. It is based on the observation that the execution resources of a processor are rarely fully utilized. Due to memory latencies and data dependencies between instructions, execution units have to wait for instructions to finish. While modern processors have out-of-order execution units which reorder instructions to minimize these delays, they rarely find enough independent instructions to exploit the processor's full potential. SMT uses the concept of multiple logical processors which share the resources (including caches) of just one physical processor. Executing two threads simultaneously on one processor has the advantage of more independent instructions being available, and thus leads to more efficient CPU utilization. This can be implemnted by duplicating state registers, which only leads to little increases in manufacturing costs. Intel's SMT implementation is called Hyper-Threading [28] and was first available on the Pentium 4 CPU. Currently, two logical CPUs per physical CPU are supported (see Figure 3.13). Hyper-Threading adds less than 5% to the relative chip size and maximum power requirements, but can provide performance benefits much greater than that.

Figure 3.13: Comparison of conventional CPU and Hyper-Threading CPU. (a) conventional CPU with single architectural state. (b) Hyper-Threading CPU with duplicated architectural state, one for each logical processor.
\includegraphics{algorithm/images/conventionalcpu.eps} \includegraphics{algorithm/images/hyperthreadingcpu.eps}
(a) (b)

Exploiting SMT, however, is not as straight-forward as it may seem at first glance. Since the logical processors share caches, it is essential that the threads operate on neighboring data items. Therefore, in general, treating the logical CPUs in the same way as physical CPUs leads to little or no performance increase. Instead, it might even lead to a decrease in performance, due to cache thrashing. Thus, our processing scheme has to be extended in order to allow multiple threads to operate within the same block.

The blocks are distributed among physical processors as described in the previous section. Additionally, within a block, multiple threads each executing on a logical CPU simultaneously process the rays of a block. Using several threads to process the ray list of a block would lead to race conditions and would therefore require expensive synchronization. Thus, instead of each block having just one ray list for every physical CPU, we now have $count_{logical}$ lists per physical CPU, where $count_{logical}$ is the number of threads that will simultaneously process the block, i.e., the number of logical CPUs per physical CPU. Thus, each block has $count_{physical} \cdot count_{logical}$ ray lists $raylist[id_{physical}][id_{logical}]$ where $id_{physical}$ identifies the physical CPU and $id_{logical}$ identifies the logical CPU relative to the physical CPU. A ray can move between physical CPUs depending on how the block lists are partitioned within in each pass, but they always remain in a ray list with the same $id_{logical}$. This means that for equal workloads between threads, the rays have to be initially distributed among these lists, e.g., by alternating the $id_{logical}$ of the list a ray is inserted into during ray setup.

The basic algorithm described in the previous section is extended in the following way: The ProcessBlocks procedure (see Algorithm 9) now starts the execution of $ProcessRays$ for each logical CPU of the physical CPU it is executed on. ProcessRays (see Algorithm 10) processes the rays of a block for one logical CPU. All other routines remain unchanged.

% latex2html id marker 1111\footnotesize
...logical}+count_{logical}}$\ to finish

% latex2html id marker 1128\footnotesize

Figure 3.14 depicts the operation of the algorithm for a system with two physical CPUs, each allowing simultaneous execution of two threads, i.e. $count_{physical}=2$ and $count_{logical}=2$. In the beginning seven treads, $T_0,...,T_6$, are started. $T_0$ performs all the preprocessing. In particular, it has to assign the rays to those blocks through which the rays enter the volume first. Then it has to choose the lists of blocks which can be processed simultaneously, with respect to the eight to distinguish viewing directions. Each list is partitioned by $T_0$ and sent to $T_1$ and $T_2$. After a list is sent, $T_0$ sleeps until its slaves are finished. Then it continues with the next pass. $T_1$ sends one block after the other to $T_3$ and $T_4$. $T_2$ sends one block after the other to $T_5$ and $T_6$. After a block is sent, they sleep until their slaves are finished. Then they send the next block to process, and so on. $T_3$, $T_4$, $T_5$, and $T_6$ perform the actual raycasting. Thereby $T_3$ and $T_4$ simultaneously process one block, and $T_5$ and $T_6$ simultaneously process one block.

Figure 3.14: Simultaneous Multithreading enabled raycasting. The work is distributed among the threads $T_i$ executing on different logical CPUs.