# A VLSI-Design for fast Vector Normalization 

Günter Knittel<br>Wilhelm-Schickard-Institut für Informatik - Graphisch-Interaktive Systeme (WSI/GRIS)<br>Universität Tübingen<br>Auf der Morgenstelle 10, D-72076 Tübingen<br>email: knittel@goya.gris.informatik.uni-tuebingen.de


#### Abstract

The design of a vector normalizer is described. It is an integral part of our graphics subsystem for scientific visualization, but will be of great use for speeding up any computer graphics architecture. In the actual design, the circuitry handles 3D-vectors with 33 bit two's complement components. The components of the normalized vectors are computed as 16 bit two's complement fixed-point numbers. Due to the overall pipeline architecture, the chip accepts one 3D-vector and produces one normalized vector each clock. To normalize a 3D-vector, three square operations, two additions, one square root operation and three divisions must be performed. The target clock frequency is 50 MHz , by which the performance of the chip rates at 450 MOPS. A single-chip VLSI implementation is currently in work, simulation results will be available by the end of the third quarter ' 93 . We use Mentor 8.2 tools on HP 700 workstations and Toshiba's TC160G Gate Array technology.


Keywords: graphics hardware, arithmetic accelerator, real-time Phong shading

## 1 Introduction

Most computer graphics algorithms require fast and frequent vector normalizations. For example, the well-known Phong illumination model [BuiT75]

$$
\begin{equation*}
I=I_{A} k_{a} C+I_{L}\left(k_{d} C\left(\stackrel{\rightharpoonup}{G}_{N} \stackrel{\rightharpoonup}{L}_{N}\right)+k_{s}\left(\stackrel{\rightharpoonup}{G}_{N} \stackrel{\rightharpoonup}{H}_{N}\right)^{n}\right) \quad(\text { simplified })^{*} \tag{1}
\end{equation*}
$$

calculates the light intensity / of a point on an object surface according to four unit vectors:

- the surface normal $\vec{G}_{N}$,
- the normalized vector $\vec{L}_{N}$ in direction of the light source and
a the so-called halfway vector $\vec{H}_{N}$, which in turn is the normalized sum of $\vec{L}_{N}$ and the normalized vector $V_{N}$ in direction of the observer.
Applications aiming at virtual reality, e.g., the graphics subsystem for volume rendering developed at WSI/GRIS [Knit93], must provide perspective projection and non-parallel light, that is, none of the vectors is constant. Unfortunately, normalizing a vector presents a great computational expense (especially the square root operation) and, moreover, (1) has to be evaluated several millions of times each second.
This was the motivation to develop a high-speed single-chip vector normalizer. The large number of vectors to be processed sequentially permits the use of a moderately deep pipeline structure without any performance penalty. For the square root function, an algorithm was adapted which computes one result bit per stage and uses only a small circuitry within each stage. The architecture of the chip is scalable with respect to speed and required chip space (by placing more or less functional units into a single pipeline stage) or precision (by adding the appropriate number of stages and operand bits).

[^0]
## 2 Architectural Overview

The circuitry described on the following pages accepts the components of a 3D-vector $\vec{V}=\left\{v_{x} ; v_{y} ; v_{z}\right\}$ and produces its associated normalized vector $\vec{N}=\left\{n_{x} ; n_{y} ; n_{z}\right\}$.


The block diagram shows the deep but regular pipeline structure of the chip. The boxes with the small filled triangle represent registers. The register structure within the pipelined units (square root unit and divide unit) has been ommitted for clarity, but will be explained in later sections.
Operands which skip certain functional units must travel through pipeline registers (FIFOs) to maintain synchronization. Thus, FIFO memories must also be placed onto the chip.
There are no feedbacks or functional units for exception handling required, by which the control structure becomes extremely simple. There is an additional valid flag which travels along with each vector and a small circuitry to mask the clock. Besides that, the chip has just to be clocked.
The excessive pipeline structure relies on a great number of vectors to be processed sequentially, as is the case in most computer graphics applications and especially in the algorithms used in our voxel subsystem. Thus, the pipeline will always be filled and so operate at maximum efficiency.
We assume a global space of 32 bit extent in each direction, that is

$$
\begin{equation*}
-2^{31} \leq x, y, z \leq 2^{31}-1 . \tag{2}
\end{equation*}
$$

Therefore, the input operands are expected to be 33 bit two's complement integers. Smaller operands must be sign-extended to 33 bits. The components of the normalized vector are computed as 16 bit two's complement fixedpoint numbers

$$
\begin{equation*}
n=-n_{0} \times 2^{0}+\sum_{j=-1}^{-15} n_{j} \times 2^{j} \tag{3}
\end{equation*}
$$

Thus, the chip has $147 \mathrm{I} / \mathrm{O}$ - pins (excluding control-, test- and clock-terminals).
We will now describe all functional units in dataflow order in details, e.g. by Boolean equations or by schematic drawings. For each functional unit, a coarse gate count estimation will be given.

## 3 Naming Conventions

A vector is denoted by an uppercase letter with an arrow. The components are designated by the lowercase letter with the indeces $\mathrm{x}, \mathrm{y}$ and z , e.g. $\vec{U}=\left\{u_{x} ; u_{y} ; u_{z}\right\}$. If an operation is applied to any component, the index is omitted. The particular bits of a component or a magnitude are identified by subscript numbers, e.g. $u=\left\{u_{15} ; u_{14} ; u_{13} \ldots u_{0}\right\}$. The vector length is represented by the uppercase letter without any diacritical marks. The bits of squared variables are quoted, e.g. $U^{2}=\left\{U_{31}^{\prime} ; U_{30}^{\prime} \ldots U_{0}^{\prime}\right\}$.

## 4 The Sign Unit (Input)

The sign unit at the inputs converts a 33 bit two's complement number $v$ into a 32 bit unsigned integer a preceded by a sign flag $S$. Thus, the range is restricted to

$$
\begin{equation*}
-2^{32}+1 \leq v \leq 2^{32}-1 . \tag{4}
\end{equation*}
$$

The sign flag is 1 if the number is negative. All sign flags are propagated through the whole circuit and passed to the sign units at the outputs.
The arithmetic operation is to invert all bits and add 1 if the highest bit is set, otherwise to leave everything unchanged. Thus:

$$
\begin{align*}
a_{0} & =v_{0} ;  \tag{5}\\
a_{1} & =\bar{v}_{32} v_{1} \vee v_{32}\left(\bar{v}_{1} v_{0} \vee v_{1} \bar{v}_{0}\right)=\bar{v}_{32} v_{1} \vee v_{32}\left(v_{1} \oplus v_{0}\right) ;  \tag{6}\\
a_{2} & =\bar{v}_{32} v_{2} \vee v_{32}\left(\bar{v}_{2}\left(v_{1} \vee v_{0}\right) \vee v_{2} \bar{v}_{1} \bar{v}_{0}\right)=\bar{v}_{32} v_{2} \vee v_{32}\left(v_{2} \oplus\left(v_{1} \vee v_{0}\right)\right) ;  \tag{7}\\
a_{3} & =\bar{v}_{32} v_{3} \vee v_{32}\left(\bar{v}_{3}\left(v_{2} \vee v_{1} \vee v_{0}\right) \vee v_{3} \bar{v}_{2} \bar{v}_{1} v_{0}\right)  \tag{8}\\
& =\bar{v}_{32} v_{3} \vee v_{32}\left(v_{3} \oplus\left(v_{2} \vee v_{1} \vee v_{0}\right)\right) ; \tag{9}
\end{align*}
$$

In general:
$a_{P}=\bar{v}_{32} v_{P} \vee v_{32}\left(v_{P} \oplus\left(v_{P-1} \vee v_{P-2} \vee \ldots \vee v_{1} \vee v_{0}\right)\right) ; \quad S=v_{32}$.
Gate count: 3.000

## 5 The Alignment Unit

In order to reduce the width of the arithmetic units, the components of the vector are uniformly scaled up or down until no component is greater than $2^{15}-1$ and at least one component is greater than or equal to $2^{14}$. Theoretically, no error emerges from this operation since

$$
\begin{equation*}
n=\frac{v \times 2^{n}}{\sqrt{\left(v_{x} \times 2^{n}\right)^{2}+\left(v_{y} \times 2^{n}\right)^{2}+\left(v_{z} \times 2^{n}\right)^{2}}}=\frac{v}{\sqrt{v_{x}^{2}+v_{y}^{2}+v_{z}^{2}}} . \tag{11}
\end{equation*}
$$

However, due to the possible truncation of large vectors, a rounding error might arise. See Section 14 Error Estimation.
To describe the function of this unit we use the following abbreviations:
$\operatorname{SHR} 17=\left(a_{x_{31}} \vee a_{y_{31}} \vee a_{z_{31}}\right)$;
$\operatorname{SHR} 16=\left(a_{x_{30}} \vee a_{y_{30}} \vee a_{z_{30}}\right) \wedge \bar{a}_{x_{31}} \wedge \bar{a}_{y_{31}} \wedge \bar{a}_{z_{31}}$;
$\operatorname{SHR} 15=\left(a_{x_{29}} \vee a_{y_{29}} \vee a_{z_{29}}\right) \wedge \bar{a}_{x_{31}} \wedge \bar{a}_{x_{30}} \wedge \bar{a}_{y_{31}} \wedge \bar{a}_{y_{30}} \wedge \bar{a}_{z_{31}} \wedge \bar{a}_{z_{30}}$;
$\operatorname{SHR} 14=\left(a_{x_{28}} \vee a_{y_{28}} \vee a_{z_{28}}\right) \wedge \bar{a}_{x_{31}} \wedge \ldots \wedge \bar{a}_{x_{29}} \wedge \bar{a}_{y_{31}} \wedge \ldots \wedge \bar{a}_{y_{29}} \wedge \bar{a}_{z_{31}} \wedge \ldots \wedge \bar{a}_{z_{29}} ;$

SHL1 $=\left(a_{x_{13}} \vee a_{y_{13}} \vee a_{z_{13}}\right) \wedge \bar{a}_{x_{31}} \wedge \ldots \wedge \bar{a}_{x_{14}} \wedge \bar{a}_{y_{31}} \wedge \ldots \wedge \bar{a}_{y_{14}} \wedge \bar{a}_{z_{31}} \wedge \ldots \wedge \bar{a}_{z_{14}} ;$
SHL14 $=\left(a_{x_{0}} \vee a_{y_{0}} \vee a_{z_{0}}\right) \wedge \bar{a}_{x_{31}} \wedge \ldots \wedge \bar{a}_{x_{1}} \wedge \bar{a}_{y_{31}} \wedge \ldots \wedge \bar{a}_{y_{1}} \wedge \bar{a}_{z_{31}} \wedge \ldots \wedge \bar{a}_{z_{1}} ;$
The function of the alignment circuitry is then defined by:
$q_{0}=a_{0} \wedge$ SH0 $\vee a_{1} \wedge$ SHR $1 \vee a_{2} \wedge$ SHR $2 \vee a_{3} \wedge$ SHR $3 \vee \ldots \vee a_{17} \wedge$ SHR17;
$q_{1}=a_{0} \wedge$ SHL $1 \vee a_{1} \wedge$ SHO $\vee a_{2} \wedge$ SHR $1 \vee a_{3} \wedge$ SHR $2 \vee \ldots \vee a_{18} \wedge$ SHR 17 ;
$q_{2}=a_{0} \wedge$ SHL $2 \vee a_{1} \wedge$ SHL1 $\vee a_{2} \wedge$ SH0 $\vee a_{3} \wedge$ SHR $1 \vee \ldots \vee a_{19} \wedge$ SHR 17 ;
$q_{14}=a_{0} \wedge$ SHL14 $\vee \ldots \vee a_{13} \wedge$ SHL $1 \vee a_{14} \wedge$ SH0 $\vee a_{15} \wedge$ SHR $1 \vee \ldots$

$$
\begin{equation*}
\ldots \vee a_{31} \wedge \text { SHR17 } \tag{22}
\end{equation*}
$$

Gate count: 3.000

## 6 The Square Units

Since the input operands are 15 bit integers, the results are 30 bit positive numbers. We use standard multiplier networks. However, the computing pattern shows some redundancy which can be exploited to cut the required chip space by one half.
We will demonstrate the scheme for a 6 bit number.
$q^{2}=\left(q_{5} 2^{5}+q_{4} 2^{4}+q_{3} 2^{3}+q_{2} 2^{2}+q_{1} 2^{1}+q_{0} 2^{0}\right)^{2}=\left(q^{\prime}{ }_{11} 2^{11}+q_{10}^{\prime} 2^{10}+\ldots+q_{0}^{\prime} 2^{0}\right)$.
The computing pattern is shown in the following table:

| $\mathrm{q}_{11}^{\prime}$ | $\mathrm{q}^{\prime}{ }_{10}$ | $\mathrm{q}_{9}^{\prime}$ | $\mathrm{q}_{8}^{\prime}{ }_{8}$ | $\mathrm{q}_{7}^{\prime}$ | $\mathrm{q}_{6}^{\prime}$ | $\mathrm{q}_{5}^{\prime}{ }_{5}$ | $\mathrm{q}_{4}^{\prime}$ | $\mathrm{q}_{3}^{\prime}$ | $\mathrm{q}^{\prime}{ }_{2}$ | $\mathrm{q}_{1}^{\prime}$ | $\mathrm{q}_{0}^{\prime}$ |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  |  |  |  |  |  | $\mathrm{q}_{5} \mathrm{q}_{0}$ | $\mathrm{q}_{4} \mathrm{q}_{0}$ | $\mathrm{q}_{3} \mathrm{q}_{0}$ | $\mathrm{q}_{2} \mathrm{q}_{0}$ | $\mathrm{q}_{1} \mathrm{q}_{0}$ | $\mathrm{q}_{0}$ |
|  |  |  |  |  | $\mathrm{q}_{5} \mathrm{q}_{1}$ | $\mathrm{q}_{4} \mathrm{q}_{1}$ | $\mathrm{q}_{3} \mathrm{q}_{1}$ | $\mathrm{q}_{2} \mathrm{q}_{1}$ | $\mathrm{q}_{1}$ | $\mathrm{q}_{1} \mathrm{q}_{0}$ |  |
|  |  |  |  | $\mathrm{q}_{5} \mathrm{q}_{2}$ | $\mathrm{q}_{4} \mathrm{q}_{2}$ | $\mathrm{q}_{3} \mathrm{q}_{2}$ | $\mathrm{q}_{2}$ | $\mathrm{q}_{2} \mathrm{q}_{1}$ | $\mathrm{q}_{2} \mathrm{q}_{0}$ |  |  |
|  |  |  | $\mathrm{q}_{5} \mathrm{q}_{3}$ | $\mathrm{q}_{4} \mathrm{q}_{3}$ | $\mathrm{q}_{3}$ | $\mathrm{q}_{3} \mathrm{q}_{2}$ | $\mathrm{q}_{3} \mathrm{q}_{1}$ | $\mathrm{q}_{3} \mathrm{q}_{0}$ |  |  |  |
|  |  | $\mathrm{q}_{5} \mathrm{q}_{4}$ | $\mathrm{q}_{4}$ | $\mathrm{q}_{4} \mathrm{q}_{3}$ | $\mathrm{q}_{4} \mathrm{q}_{2}$ | $\mathrm{q}_{4} \mathrm{q}_{1}$ | $\mathrm{q}_{4} \mathrm{q}_{0}$ |  |  |  |  |
|  | $\mathrm{q}_{5}$ | $\mathrm{q}_{5} \mathrm{q}_{4}$ | $\mathrm{q}_{5} \mathrm{q}_{3}$ | $\mathrm{q}_{5} \mathrm{q}_{2}$ | $\mathrm{q}_{5} \mathrm{q}_{1}$ | $\mathrm{q}_{5} \mathrm{q}_{0}$ |  |  |  |  |  |

Most elements occur twice and so the table can be reorganized:

| $\mathrm{q}_{11}^{\prime}$ | $\mathrm{q}_{10}^{\prime} \mathrm{q}^{\prime}{ }_{9}^{\prime}$ | $\mathrm{q}_{8}^{\prime}$ | $\mathrm{q}_{7}^{\prime}$ | $\mathrm{q}_{6}^{\prime}$ | $\mathrm{q}^{\prime}{ }_{5}$ | $\mathrm{q}_{4}^{\prime}$ | $\mathrm{q}_{3}^{\prime}$ | $\mathrm{q}_{2}^{\prime}$ | $\mathrm{q}_{1}^{\prime}$ | $\mathrm{q}^{\prime}{ }_{0}$ |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|  |  |  |  |  | $\mathrm{q}_{5} \mathrm{q}_{0}$ | $\mathrm{q}_{4} \mathrm{q}_{0}$ | $\mathrm{q}_{3} \mathrm{q}_{0}$ | $\mathrm{q}_{2} \mathrm{q}_{0}$ | $\mathrm{q}_{1} \mathrm{q}_{0}$ |  | $\mathrm{q}_{0}$ |
|  |  |  |  | $\mathrm{q}_{5} \mathrm{q}_{1}$ | $\mathrm{q}_{4} \mathrm{q}_{1}$ | $\mathrm{q}_{3} \mathrm{q}_{1}$ | $\mathrm{q}_{2} \mathrm{q}_{1}$ |  | $\mathrm{q}_{1}$ |  |  |
|  |  |  | $\mathrm{q}_{5} \mathrm{q}_{2}$ | $\mathrm{q}_{4} \mathrm{q}_{2}$ | $\mathrm{q}_{3} \mathrm{q}_{2}$ |  | $\mathrm{q}_{2}$ |  |  |  |  |
|  |  | $\mathrm{q}_{5} \mathrm{q}_{3}$ | $\mathrm{q}_{4} \mathrm{q}_{3}$ |  | $\mathrm{q}_{3}$ |  |  |  |  |  |  |
|  | $\mathrm{q}_{5} \mathrm{q}_{4}$ |  | $\mathrm{q}_{4}$ |  |  |  |  |  |  |  |  |
|  | $\mathrm{q}_{5}$ |  |  |  |  |  |  |  |  |  |  |

The following circuitry performs this function ( $\left.q_{0}^{\prime}=q_{0} ; q_{1}^{\prime}=0\right)$ :


Gate count: 3.500

## 7 The $\Sigma$ - Unit

This is a standard triple input integer adder. The results are 32 bit integers. The alignment unit assures that

$$
\begin{equation*}
10000000 H \leq Q^{2} \leq B F F D 0003 H \tag{25}
\end{equation*}
$$

Gate count: 3.000

## 8 The Scale Unit

The components of the vector and the vector length are scaled so that

$$
\begin{equation*}
10000000 H \leq T^{2} \leq 3 F F F F F F F H \quad \text { or } \quad 2^{14} \leq T \leq 2^{15}-1 . \tag{26}
\end{equation*}
$$

That is, the squared vector length is shifted right two places and each of the components is shifted right one digit if one of the most significant bits of the squared vector length is set. The function is then described as follows:
$T_{j}=Q_{j}^{\prime} \wedge \bar{Q}_{31}^{\prime} \wedge \bar{Q}_{30}^{\prime} \vee Q_{j+2}^{\prime} \wedge\left(Q_{31}^{\prime} \vee Q_{30}^{\prime}\right) \quad$ for $\quad 0 \leq j \leq 29$,
$t_{j}=q_{j} \wedge \bar{Q}_{31}^{\prime} \wedge \bar{Q}_{30}^{\prime} \vee q_{j+1} \wedge\left(Q_{31}^{\prime} \vee Q_{30}^{\prime}\right) \quad$ for $\quad 0 \leq j \leq 13 \quad$ and
$t_{14}=q_{14} \wedge \bar{Q}_{31}^{\prime} \wedge \bar{Q}_{30}^{\prime}$.
Again, truncation errors are subject of consideration (see Section 14 Error Estimation).

## 9 The Square Root Unit

The algorithm is first explained for decimal numbers [GKHK86]. For every two digits of an integer, the integer part of the square root has one digit. If the number of digits is odd, a zero must be placed in front of the integer. The same holds true for decimals, except that a zero must be appended if necessary. Let us consider the following example.

$$
\begin{gathered}
97|26| 41.70 \\
A \quad B \quad C . D
\end{gathered}
$$

We start the calculation with the most significant digit. $A$ is the integer part of the square root of 97 , no matter what follows. Thus, $A=9$. The remainder $R_{A}=16$.
Now let's consider the next two digits. $10 \times A$ is still an approximation of the square root of 9726. The new remainder $R_{A}{ }^{*}=100 \times R_{A}+26=1626$.

If we add $B$, the new square root is $(10 \times A+B)$. By doing so, we increase the square by $20 \times A \times B+B^{2}$. So we can formulate:

$$
\begin{equation*}
R_{A}^{*} \geq 20 \times A \times B+B^{2} ; \tag{30}
\end{equation*}
$$

Since $20 \times A \times B>B^{2}$, we will for the moment neglect $B^{2}$. A rough estimation gives:

$$
\begin{equation*}
B=\left\lfloor R_{A}^{*} / 20 \times A\right\rfloor=\lfloor 1626 / 180\rfloor=9 ; \tag{31}
\end{equation*}
$$

If this value satisfies (30), we have found $B$. In this example, however, this is not the case and therefore we have to decrement $B$ by one. Thus, $B=8$. The remainder $R_{B}$ is given by:

$$
\begin{equation*}
R_{B}=R_{A}^{*}-\left(20 \times A \times B+B^{2}\right)=1626-(1440+64)=122 ; \tag{32}
\end{equation*}
$$

Now we are ready for the next two digits. Again, $10 \times A B$ is still an approximation of the square root of 972641 (Don't get confused with the notation: $A B=98$ in this example, whereas $A \times B=72$ ). The new remainder $R_{B}{ }^{*}=100 \times R_{B}+41=12241$.
We add C to increase the square by $20 \times A B \times C+C^{2}$. So we find:

$$
\begin{equation*}
R_{B}^{*} \geq 20 \times A B \times C+C^{2} ; \tag{33}
\end{equation*}
$$

Another guesstimate gives:

$$
\begin{equation*}
C=\left\lfloor{R_{B}^{*}}^{*} / 20 \times A B\right\rfloor=\lfloor 12241 / 1960\rfloor=6 ; \tag{34}
\end{equation*}
$$

This time (33) is satisfied. The integer part of the root therefore is 986 . For the decimals we can proceed in just the same way:

$$
\begin{gather*}
R_{C}=R_{B}^{*}-\left(20 \times A B \times C+C^{2}\right)=12241-(20 \times 98 \times 6+36)=455 ;  \tag{35}\\
R_{C}^{*}=100 \times R_{C}+70=44570 ;  \tag{36}\\
R_{C}^{*} \geq 20 \times A B C \times D+D^{2} ;  \tag{37}\\
D=\left\lfloor R_{C}^{*} / 20 \times A B C\right\rfloor=\lfloor 44570 / 19720\rfloor=2 ; \tag{38}
\end{gather*}
$$

The final result:

$$
\begin{equation*}
\sqrt{972641.7} \approx 986.2 \tag{39}
\end{equation*}
$$

This procedure can be continued by appending pairs of zeros to the decimals until the required precision is reached.

The advantages of this method will become clear if we consider binary numbers. Again the integer part of the square root has one bit for every pair of bits of the operand. For a better readability we will denote the square bits as $D_{n}$, that is $T^{2}=\left\{D_{29} ; D_{28} \ldots D_{0}\right\}$.

Input operand


Square root

| $\mathrm{T}_{14}$ | $\mathrm{~T}_{13}$ | $\mathrm{~T}_{12}$ | $\mathrm{~T}_{11}$ | $\mathrm{~T}_{10}$ | $\mathrm{~T}_{9}$ |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |

Naturally, the most significant bit $T_{14}$ can only be 0 or 1 (accordingly, there are only two square numbers which fit into two bits: 00 and 01). Thus:

$$
\begin{equation*}
T_{14}=D_{29} \vee D_{28} ; \tag{40}
\end{equation*}
$$

The remainder $R^{14}{ }_{[29.28]}$ is calculated according to

$$
\begin{equation*}
R_{29}^{14}=D_{29} \wedge D_{28} \quad \text { and } \quad R_{28}^{14}=D_{29} \wedge \bar{D}_{28} \tag{41}
\end{equation*}
$$

The square root of the binary number $D_{29} D_{28} D_{27} D_{26}$ is still approximated by $2 \times T_{14}$, the remainder $R^{\star}{ }_{[29.26]}$ is represented by $R_{29}^{14} R_{28}^{14} D_{27} D_{26}$.
If we add $T_{13}$, the square is increased by $4 \times T_{14} \times T_{13}+T_{13}^{2}$. So the following relation must be satisfied:

$$
\begin{equation*}
R_{[29 \ldots 26]}^{*} \geq 4 \times T_{14} \times T_{13}+T_{13}^{2} ; \tag{42}
\end{equation*}
$$

If we assume $T_{13}=1$, then $R^{*}{ }_{[29 \ldots 26]} \geq 4 \times T_{14}+1$.
In other words: $T_{13}=1$ if the above test passes, else $T_{13}=0$. The new remainder $R^{13}{ }_{[29.26]}$ is computed by:

$$
\begin{equation*}
R_{[29 \ldots 26]}^{13}=R_{[29 \ldots 26]}^{*}-4 \times T_{14} \times T_{13}-T_{13}^{2} ; \tag{44}
\end{equation*}
$$

That is, the remainder is left unchanged in the case $T_{13}=0$.
This function is performed by the simple circuitry shown below. The subtractor generates the flag $\bar{N}$ (egative, active low) as result bit which also controls the Multiplexer.


For clarity, we will demonstrate the computation of the next result bit, $T_{12}$. The new intermediate remainder $R^{*}[28 \ldots 24]$ is constructed from $R_{28}^{13} R_{27}^{13} R_{26}^{13} D_{25} D_{24}$. If we add $T_{12}$, the square is increased by $4 \times T_{14} T_{13} \times T_{12}+T_{12}^{2}$.
$T_{12}=1$ if $R^{*}{ }_{[28 \ldots 24]} \geq 4 \times T_{14} T_{13}+1$.
Depending on the result of this compare operation, the new remainder is either left unchanged or
$R_{[27 \ldots 24]}^{12}=R_{[28 \ldots 24]}^{*}-4 \times T_{14} T_{13}-1$.
Just as we did for decimal numbers, we can repeat this calculation until the required precision is obtained.
Note that $R_{29}^{13}$ and $R_{28}^{12}$ have been dismissed. They are always 0 . In general, the remainder has at most one digit more than the root. This can be shown as follows. Consider the integer numbers Z and N , where N is the integer part of the square root of Z .


For the remainder $R$ we can formulate:

$$
\begin{gather*}
R \leq(N+1)^{2}-N^{2}-1 ;  \tag{47}\\
R \leq N^{2}+2 N+1-N^{2}-1 ;  \tag{48}\\
R \leq 2 N ; \tag{49}
\end{gather*}
$$

The block diagram on the next page shows the circuitry for 30 bit integers $D_{[29 . .0]}$. Except for the first one a register is inserted after each stage. The multiplexers and registers at the outputs of the subtractors have been merged into a single symbol.
In general, for the calculation of the integer part of a square root of a number with $N$ bits (where $N$ is assumed even), we need $N / 2-1$ subtractors, starting with a 4 -bit and ending with a (N/2 + 2) - bit subtractor. N/2-2 multiplexers are also required, starting with a 3 -bit, ending with a $N / 2$-bit multiplexer.
In this particular case, the Square Root Unit has 14 stages, 14 subtractors from 4 to 17 bits, 13 multiplexers from 3 to 15 bits and 433 register bits.
Gate count: 6.000.

## 10 Component FIFO

This is a $14 \times 49$ bit memory including one valid bit per vector. The components FIFO should be realized as a register pipeline (as opposed to the usual "fall-through" - architecture of FIFOs), so that the components and the vector length arrive at the same time at the inputs of the divide units without special control circuitry.
Gate count: 5.000


## 11 The Division Unit

The components $t$ are taken from the component FIFO as 15 bit unsigned integers preceded by a sign bit. The vector length $T$ arrives as a 15 bit unsigned integer as well. Thus, there are three unsigned division pipelines, as for example explained in [Hoff82], to be constructed. The results $\left\{m_{x} ; m_{y} ; m_{z}\right\}$ shall be computed as 15 bit fixed point numbers.
Since $0 \leq a \leq V, 0 \leq m \leq 1$. In the first instance we assume that

$$
\begin{equation*}
m=\sum^{-14} m_{j} 2^{j} \quad \text { where } \quad m=|n| \tag{50}
\end{equation*}
$$

The algorithm shall be explained by an example where $t=011100111010011$ and $T=100001111010111$. The quotient is computed bitwise using 16 bit two's complement arithmetic.
$m_{0}=1$ if $t-T \geq 0$. The light grey cells contain the sign extensions of the operands. The dark cell holds the inverted result bit $m_{0}$.

|  | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Bit-Position |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- |
|  | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | Component $t$ |
| +1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 2's Compl. of $T$ |  |
| $=$ | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | Remainder $\mathrm{R}_{[15.0]}^{0}$ |  |

If $m_{0}=0$, the remainder $R^{0}$ must be corrected by adding $T$. Then, $R^{-1}=R^{0}-T / 2$, and $m_{-1}=1$ if $R^{-1} \geq 0$. However, the same can be achieved by adding $T / 2$ to $R^{0}$ if $m_{0}=0$ and subtracting $T / 2$ from $R^{0}$ if $m_{0}=1$.

|  | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -1 | Bit-Position |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- | :--- |
| $\nabla$ | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | Remainder $\mathrm{R}_{[15 \ldots-1]}$ |  |
| + | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | $T / 2$ |  |
| $=$ | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | Remainder $\mathrm{R}^{-1}[14 \ldots-1]$ |  |

Note that the result bits can be excluded from further calculation since

$$
\begin{align*}
& \left|R^{0}\right|=|t-T| \leq T  \tag{51}\\
& \left|R^{-1}\right|=\left|\left|R^{0}\right|-T / 2\right| \leq T / 2 ;  \tag{52}\\
& \left|R^{-2}\right|=\left|\left|R^{-1}\right|-T / 4\right| \leq T / 4 \text { and so forth. } \tag{53}
\end{align*}
$$

Thus, the width of the required ALUs remains constant throughout the complete pipeline. The computation is continued in this way until the required precision is reached. The last remainder is discarded.

However, this scheme makes no good use of the available precision. $m_{0}$ is set only in the case $t=T$. To increase the precision, we use the following format instead:

$$
m=\sum_{j=-1}^{-15} m_{j} 2^{j}
$$

The maximum error is then reduced to $2^{-15}$
For $t=T, m$ is expressed as 0.111111111111111 . This is achieved by assuming $m_{0} \equiv 0$ and starting the computation with the operation $t-T / 2$.
The first step is given below:

|  | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | -1 | Bit-Position |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :--- |
|  | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | Component $t$ |
| + | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 2's Compl. of $T / 2$ |
| $=$ | $\mathbf{8}$ | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | Remainder $\mathrm{R}_{[14 \ldots-1]}^{-1}$ |

The circuitry shown on the next side performs this function. Each pipeline stage computes one result bit. The three division pipelines consume approximately 40.000 gates.

## 12 The Sign Unit (Outputs)

The sign units at the outputs perform the inverse function as the sign units at the inputs, however, the arithmetic operation is the same. The 15 bit positive components $m$, which are preceded by a sign flag $S$, are converted into16 bit two's complement components $n$.
Again we formulate:

$$
\begin{align*}
n_{-15} & =m_{-15} ;  \tag{55}\\
n_{-14} & =\bar{S}_{-14} \vee S\left(\bar{m}_{-14} m_{-15} \vee m_{-14} \bar{m}_{-15}\right)=\bar{S}_{-14} \vee S\left(m_{-14} \oplus m_{-15}\right) ;  \tag{56}\\
n_{-13} & =\bar{S}_{m_{-13}} \vee S\left(\bar{m}_{-13}\left(m_{-14} \vee m_{-15}\right) \vee m_{-13} \bar{m}_{-14} \bar{m}_{-15}\right)  \tag{57}\\
& =\bar{S} m_{-13} \vee S\left(m_{-13} \oplus\left(m_{-14} \vee m_{-15}\right)\right) ;  \tag{58}\\
\ldots &  \tag{59}\\
n_{-1} & =\bar{S}_{m_{-1}} \vee S\left(m_{-1} \oplus\left(m_{-2} \vee m_{-3} \vee \ldots \vee m_{-14} \vee m_{-15}\right)\right) .  \tag{60}\\
n_{0} & =S .
\end{align*}
$$

Gate count: 1.500

## 13 Control Structure

If the component FIFO is realized as a register pipeline (as opposed to the usual "fall-through"architecture of FIFOs), there is no internal control structure required. All operands travel the same distance and so the chip just has to be clocked.
Provisions are made to freeze the pipeline. The activation of an external signal masks the clock. This circuitry is designed very carefully to avoid spikes on the internal clock lines.
The valid flags, one for each stage, must be reset during initialization. The valid flag must be held active at the inputs whenever a vector is clocked in. Normalized vectors are available as long as the valid vector output maintains an active state.
"Design-for-Testability" features are also taken into account. We use scan-path flipflops for all registers to construct one or more scan chains.


One of three Division Pipelines

## 14 Error Estimation

Incoming components $v$ are considered to be "true values". The normalization of $\vec{V}$ without
 $\stackrel{\rightharpoonup}{N}=\vec{N}_{E}+\Delta \vec{N}$.
The sign units at the inputs operate precision conserving, that is

$$
\begin{equation*}
a=|v| . \tag{61}
\end{equation*}
$$

Depending on their size, the operands are possibly right shifted and truncated by the alignment unit and the scale unit. For simplicity, let's assume that the error $\Delta t$ is defined by

$$
\begin{equation*}
-1 \leq \Delta t \leq 0 \tag{62}
\end{equation*}
$$

Due to this discretization of the components, a change in direction of the normalized vector might occur. Instead of the vector $\vec{V}=\left\{v_{x} ; v_{y} ; v_{z}\right\}$, the vector $\vec{T}=\left\{t_{x} ; t_{y} ; t_{z}\right\}$ is normalized. Provided this computation is carried out accurately, the maximum deviation occurs for
$\vec{T}=\left\{V_{\min } ; 0 ; 0\right\}$ and $\Delta t_{y}=\Delta t_{z}=-1 \quad$ where $\quad V_{\min }=2^{14}$.
The error vector $\vec{M}_{D}$ is then defined by:

$$
\begin{equation*}
\vec{M}_{D} \approx\left\{0 ;-2^{-14} ;-2^{-14}\right\} \quad \text { where } \quad M_{D}=8,63 \times 10^{-5} \tag{64}
\end{equation*}
$$

Any other permutation of the components in (63) and (64) yields the same result for $M_{D}$. However, there might be an error in $T$, so that $\vec{T}$ is not scaled properly. We have to distinguish two cases:
1.) There was no shift operation in the scale unit,
$T^{2}$ is the true squared length of $\vec{T}$. However, the limited precision of the square root unit causes a truncation error $\Delta T$. For simplicity, we assume that

$$
\begin{equation*}
-1 \leq \Delta T \leq 0 \tag{65}
\end{equation*}
$$

## 2.) The scale unit performed a right shift operation.

The squared vector length is divided by 4 and the two LSBs are discarded. After this operation, the range of $T^{2}$ is given by

$$
\begin{equation*}
10000000 H \leq T^{2} \leq 2 F F F 4000 H \tag{66}
\end{equation*}
$$

and therefore, the truncation error of $T^{2}$ can be neglected. So it can be said that

$$
\begin{equation*}
T^{2}=\left(\frac{q_{x}}{2}\right)^{2}+\left(\frac{q_{y}}{2}\right)^{2}+\left(\frac{q_{z}}{2}\right)^{2} \tag{67}
\end{equation*}
$$

On the other hand,

$$
\begin{equation*}
\stackrel{\rightharpoonup}{T}=\left\{t_{x} ; t_{y} ; t_{z}\right\}=\left\{\left\lfloor\frac{q_{x}}{2}\right\rfloor ;\left\lfloor\frac{q_{y}}{2}\right\rfloor ;\left\lfloor\frac{q_{z}}{2}\right\rfloor\right\} \tag{68}
\end{equation*}
$$

Taking the truncation error of the square root unit into account, the resulting error $\Delta T$ is then given by

$$
\begin{equation*}
-1 \leq \Delta T \leq 0.5 \times \sqrt{3} . \tag{69}
\end{equation*}
$$

For a given $\Delta T$, the resulting vector length $M$ is given by:

$$
\begin{equation*}
M=\sqrt{\left(\frac{t_{x}}{T+\Delta T}\right)^{2}+\left(\frac{t_{y}}{T+\Delta T}\right)^{2}+\left(\frac{t_{z}}{T+\Delta T}\right)^{2}}=\sqrt{\frac{t_{x}^{2}+t_{y}^{2}+t_{z}^{2}}{(T+\Delta T)^{2}}}=\frac{T}{T+\Delta T} \approx 1-\frac{\Delta T}{T} \tag{70}
\end{equation*}
$$

For the moment we assume that the divide units operate at infinite precision. Then the error vector $\vec{M}_{S}$ is given by:

$$
\begin{equation*}
\vec{M}_{S}=s \times \vec{M} \quad \text { where } \quad-\sqrt{3} \times 2^{-15} \leq s \leq 2^{-14} \tag{71}
\end{equation*}
$$

The maximum truncation error of the divide units is $-2^{-15}$ for each component. This produces an additional error vector $\vec{M}_{T}$, given by:

$$
\begin{equation*}
\vec{M}_{T}=\left\{-2^{-15} \ldots 0 ;-2^{-15} \ldots 0 ;-2^{-15} \ldots 0\right\} \tag{72}
\end{equation*}
$$

The error vector $\Delta \vec{N}$ is then defined by:

$$
\begin{equation*}
\Delta \stackrel{\rightharpoonup}{N}=\vec{M}_{D}+\vec{M}_{S}+\vec{M}_{T} \tag{73}
\end{equation*}
$$

We assume further that $\vec{M}_{D} \perp \vec{M}_{S}$.
The error vector of maximum magnitude is finally given by:

$$
\begin{equation*}
\Delta \vec{N}=\left\{ \pm 2^{-14} ;-3 \times 2^{-15} ;-3 \times 2^{-15}\right\} \quad \text { and } \quad \Delta N=1.43 \times 10^{-4} \tag{74}
\end{equation*}
$$

or any permutation of the components. The sign units at the outputs again operate precision conserving.

## 15 Design Complexity

The total number of gates needed for the functional units is approximately 70.000 . Assuming a $50 \%$ array utilization, which should be achievable in consideration of the regular structure of the chip, a 140.000 gates master is needed.

## 16 Conclusion

We presented a single-chip VLSI solution to one of the essential tasks in computer graphics, the normalization of vectors. This approach is superior over other hardware solutions such as look-up tables or micro-programmed ALUs, because it achieves maximum speed at minimum costs. Advances in VLSI technology can directly be exploited to increase clock frequency and to place multiple vector normalizers along with additional functional units onto a single chip, so that a complete Phong shader with a generation rate of 100 M pixel/s will be feasible as a single-chip device in the near future.

## 17 Acknowledgments

This work is supervised by Prof. Strasser and is part of the advanced graphics accelerator project at WSI/GRIS, University of Tuebingen, supported partially by the CEC's ESPRIT programme. Claus Dreischer, who is currently implementing the design, gave valuable suggestions and provided the gate counts. Thanks to Andreas Schilling for many helpful discussions.

## 18 References

[BuiT75] Phong Bui-Tuong, "Illumination for Computer-Generated Pictures", CACM, Vol. 18, No. 6, June 1975, pages 311-317
[GKHK86] S. Gottwald, H. Küstner, M. Hellwich and H. Kästner (Edts.), "Handbuch der Mathematik", Buch und Zeit Verlagsgesellschaft, D-5000 Köln, 1986, pages 4445
[Hoff82] Rolf Hoffmann, "Rechenwerke und Mikroprogrammierung", Oldenbourg Verlag, D-8000 München, 1982, pages 85-96
[Knit93] Günter Knittel, "VERVE - Voxel Engine for Real-time Visualization and Examination", presented at the Eurographics Conference 93, Barcelona, September 6-10, 1993


[^0]:    ${ }^{*} I_{A}$ : ambient light, $I_{L}$ : light coming from the light source, $\mathrm{k}_{\mathrm{a}}, \mathrm{k}_{\mathrm{d}}, \mathrm{k}_{\mathrm{s}}$ : ambient, diff reflection coefficients, C : color of the object, n : specular reflection exponent

    Eelíographics Digital Library

