Floating Point Arithmetic
Floating point system
Floating point arithmetic
IEEE flaoting point standard

Floating point arithmetic
Loss of associativity in floating point system
Let x be a nonzero floating point number such that
1+x=1. A simple example of loss of associativity is given by
the sum x+1-1. Clearly (x+1)+ (-1) = 0 which is not the
same as x + ( 1 + (-1) ) =x.
Another example
Let m be the smallest floating point number and M be the largest, then
( 10-2 x m ) x M and 10-2
x ( m x M ) are not the same, since (10-2 x m )
underflows to zero.
In a computer system where the number representation is normalized 4-digit
mantissa and 2-digit exponent, consider the following calculations, assuming
rounding is used in the 4-digit arithmetic registers,
- p+q+r+s= 20
- p+q+s+r= 26
- q+r+p+s= 30
where p= 75640, q= 4, r= 6, s= -75620. Explain the results.
Answer:
The floating point representations of the numbers are:
p= 0.7564 x 105, q= 0.4000 x 101, r= 0.6000
x 101, s= -0.7562 x 105.
Under the usual rule of adding from left to right:
p+q+r+s= 20
| p+q: |
| 0.7564 x 105 |
0.7564 x 105 |
| +) 0.4000 x 101 |
+) 0.0000 x 105
( matching exponent) |
|
-------------------- |
|
0.7564 x 105 |
|
| p+q+r: |
| 0.7564 x 105 |
0.7564 x 105 |
| +) 0.6000 x 101 |
+) 0.0000 x 105
( matching exponent) |
|
-------------------- |
|
0.7564 x 105 |
|
| p+q+r+s: |
| 0.7564 x 105 |
0.7564 x 105 |
| +)x 105 |
+) -0.7562 x 105 |
|
----------------------- |
|
0.0002 x 105
= 0.2000 x 102 |
|
After normalizing and adding spurious zeros, get final
result with only 1 reliable digit.
p+q+s+r= 26
| p+q: |
| 0.7564 x 105 |
0.7564 x 105 |
| +) 0.4000 x 101 |
+) 0.0000 x 105
( matching exponent) |
|
-------------------- |
|
0.7564 x 105 |
|
| p+q+s: |
| 0.7564 x 105 |
0.7564 x 105 |
| +)x 105 |
+) -0.7562 x 105 |
|
----------------------- |
|
0.0002 x 105
= 0.2000 x 102 |
|
| p+q+s+r: |
| 0.2000 x 105 |
0.2000 x 102 |
| +) 0.6000 x 101 |
+) 0.060 x 102
(matching exponent) |
|
-------------------- |
|
0.2600 x 102 |
|
Again get final result with only 1 (not 2) reliable digit.
q+r+p+s= 30
| q+r: |
| 0.4000 x 101 |
0.4000 x 101 |
| +) 0.6000 x 101 |
+) 0.6000 x 101 |
|
-------------------- |
|
1.000 x 101 =
0.1000 x 102 |
|
| q+r+p: |
| 0.1000 x 102 |
0.0001 x 105
( matching exponent) |
| +) 0.7564 x 105 |
+) 0.7564 x 105 |
|
-------------------- |
|
0.7565 x 105 |
|
| q+r+p+s: |
| 0.7565 x 105 |
0.7565 x 105 |
| +)x 105 |
+) -0.7562 x 105 |
|
----------------------- |
|
0.0003 x 105=
0.3000 x 102 |
|
After normalizing and adding spurious zeros, again get final result with
only 1 reliable digit.
{\bf More on floating point System}
5-bit representation
$x = \sigma \overline{x} 2^e$ with $\frac{1}{2} \leq \overline{x} <
1.$
i.e. $ \overline{x}= 0.1**...*$ in binary
\begin{verbatim}
sign exponent mantissa
position 1 2-3 4-5
# bits 1 2 3 (including hidden bit)
\end{verbatim}
Consider only positive numbers here : (sign= 1)
\begin{verbatim}
exponent : 00 01 10 11
value : * -1 0 1 ==> range of exponent [-1,1]
if exponent is 00, the number is 0
exponent mantissa number in binary number in decimal
00 any 0 0
01 00 0.100 x 2^-1= 0.01 0.25
01 01 0.101 x 2^-1= 0.0101 0.3125
01 10 0.110 x 2^-1= 0.0110 0.375
01 11 0.111 x 2^-1= 0.0111 0.4375
10 00 0.100 x 2^0= 0.1 0.5
10 01 0.101 x 2^0= 0.101 0.625
10 10 0.110 x 2^0= 0.110 0.75
10 11 0.111 x 2^0= 0.111 0.875
11 00 0.100 x 2^1= 1.0 1
11 01 0.101 x 2^1= 1.01 1.25
11 10 0.110 x 2^1= 1.10 1.5
11 11 0.111 x 2^1= 1.11 1.75
\end{verbatim}
Other possibilities
(i) Inclusion of other `numbers':
\begin{verbatim}
00 00 0 0
00 01 + infinity
00 10 - infinity
00 11 NaN
\end{verbatim}
(ii) More bits for exponent (e.g. 3-bit exponent, 2-bit mantissa)
\begin{verbatim}
exponent : 000 001 010 011 100 101 110 111
value : * -3 -2 -1 0 1 2 3
==> range of exponent [-3, 3]
exponent mantissa number in binary number in decimal
00 any 0 0
001 0 0.10 x 2^-3= 0.0001 0.0625
001 1 0.11 x 2^-3= 0.00011 0.
010 0 0.10 x 2^-2= 0.001 0.125
010 1 0.11 x 2^-2= 0.0011 0.1875
011 0 0.10 x 2^-1= 0.01 0.25
011 1 0.11 x 2^-1= 0.011 0.375
100 0 0.10 x 2^0= 0.1 0.5
100 1 0.10 x 2^0= 0.11 0.625
101 0 0.10 x 2^1= 1. 1
101 1 0.11 x 2^1= 1.1 1.5
110 0 0.10 x 2^2= 10 2
110 1 0.11 x 2^2= 11 3
111 0 0.10 x 2^3= 100 4
111 1 0.11 x 2^3= 110 6
\end{verbatim}
Worst possible error in chopping: \begin{verbatim}|------------x|\end{verbatim}
Worst possible error in rounding: \begin{verbatim}|------x------|\end{verbatim}
% | x- fl(x) | < 2^{-n}
\begin{enumerate}
\item The machine epsilon $\neq$ smallest positive floating point number
smallest positive floating point number is given by
mantissa bits all zero
exponent bits give most negative exponent
e.g. vax: $0.100..0 x 2^{-127}$
IEEE: $1.000..0 x 2^{-126}$
Machine epsilon: smallest number s.t. $1+ \epsilon \neq 1 $
The distance between 1 and the next larger floating point number is
2* (machine epsilon).
\noindent [Remark: use of chopping/rounding affects the value of the machine epsilon
because of the addition operation in the definition]
\begin{itemize}
\item rounding: $2^{1-n}$ (more generally $\beta^{1-n}$)
\item chopping: $2^{-n}$ ( $\frac{1}{2} \beta^{1-n}$)
\end{itemize}
where $n$ is the no. of binary digits in the mantissa (including any
hidden bit)
e.g. (rounding) vax and IEEE: $n=24$, so machine epsilon= $2^{-23}$
Note: this is much LARGER then the smallest positive floating point number.
\item The `largest integer' $M$ is represented exactly in the floating point system;
-- All integers in $[0,M]$ are exactly representable in the floating point
system.
Mantissa bits: 11...1 (k-1 binary digits) [excluding the hidden bit]
exponent bits => gives a value $2^k$ : this is ok if k lies within the
range of the exponent
e.g. vax: exponent range [-127,127], k= 24;
IEEE: exponent range [-126,127], k= 24.
vax type: $0.1\quad 11...1 \times 2^k $
\{ k binary digits (including hidden bit) behind 'binary' decimal point\}
IEEE type: $1.11...1 \times 2^{k-1}$
\{ k-1 binary digits (including hidden bit) behind 'binary' decimal point\}
\{if $2^k$ were used above, cannot guarantee all numbers below M are
representable\}
\[ \mbox{value}= (1/2+ 1/2^2+ \cdots + 1/2^k) 2^k = \frac{1}{2}
{1- 1/2^k}{1- 1/2}= 2^k - 1 \]
If exponent $2^{k+1}$ is also representable in the exponent part (usually
ok), we have $M=2^k$: set mantissa bits to 00...0 .
Now all numbers between 0 and $2^k$ are representable exactly, however,
$2^{k}+ 1$ cannot be exactly represented since we run out of binary bits.
\end{enumerate}
Tools for handling floating point error:
\begin{itemize}
\item forward error analysis- tedious
\item backward error analysis- difficult but worthwhile
\item interval analysis
\item multiple precision calculation
\item error free computation
\end{itemize}
|