6. Floating Point

February 19, 2021

Error in math comes from representation and arithmetic. Floating points is like binary scientific notation.

Definition (Normal floating point number). A normal floating point number has the form

\[(-1)^s \times (1 + 0.b_1\dots b_p)_2 \times 2^E.\]

The IEEE standard for 64-bit has 1 bit for the $s$, 11 bits for $E$, and $52$ for the fraction. $E \in [-1022, 1023]$.

A 32-bit number has 8 bits for the exponent and 23 for the fraction.


If a number if too large, then we have overflow.

If a number is too small, then the default behavior is to return a subnormal.

If a number is nonsensical $0/0$, then NaN is returned.

Round-off Errors

To represent floating points, define $fl(x)$ such that $fl(x)$ is the floating representation of $x\in \R$. Define representation error as such:

Relative representation error


This is actually quite small. Let $\epsilon$ be machine epsilon.

Corollary. $fl(x) = x(1 + \delta)$ and $ \delta < \epsilon$.

Proof. This is the maximum relative representation error of a number.

\[\epsilon = \frac{1}{2}(\text{nextfloat}(1.0) - 1.0)\\ = \frac{1}{2} (1 + 2^{-52} - 1) \approx 10^{-16}.\]

Roundoff error in arithmetic

Let $fl(x+y)$ be the closet floating point representation of $x+y$. Assuming no overflow or underflow, basic operations have relative error less than or equal to epsilon.

\[\hat z = fl(x + y). \\\hat z = z(1 + \delta)\]

Catastrophic Cancellation

If $\hat x = x(1 + \delta)$ and $\hat y = y(1 + \delta_2$ are floating point approximations to $x$ and $y$ that are very close, then $fl(\hat x - \hat y)$ may be a poor approximation to $x-y$ due to cancellation.

Example. Consider the quadratic formula where $a = 1$. The problem occurs when the smallest root is close to machine precision.

\[\begin{gather} x = \frac{-b\pm \sqrt{b^2-4c}}{2}\\ x = \frac{-b \pm \sqrt{b^2 - 4c}(1 + \delta_1)}{2}(1 + \delta_2)\\ = -b -b \delta_2 + \sqrt{b^2-4c}(1 +\delta_1 + \delta_2 + \delta_1\delta_2)\\ b \approx \frac{1}{\sqrt{\epsilon}}, c \approx 1\\ \end{gather}\]

Error Analysis

Suppose we want to compute the sum of $n$ numbers. The error is bounded by $n \epsilon$.

\[\|x - \tilde x\|/ \|x\| \leq n \epsilon.\]

From backwards stability, $\forall x$, $\tilde f(x) = f(\tilde x)$. $|x - \hat x| / |x| = O(\epsilon)$.