§ A hacker's guide to numerical analysis


Life may toss us ill-conditioned problems, but it is too short
to settle for unstable algorithms. - D.P. O'Leary

§ Measures of error


If xx is a number and x^\hat x is its approximation, then the are two notions of error:
  1. absolute errror: xx^|x - \hat x|.
  2. relative error: xx^/x|x - \hat x|/|x|.

Since the relative error is invariant under scaling (xαx)(x \mapsto \alpha x), we will mostly be interested in relative error.

§ Significant digits


The significant digits in a number are the first nonzero digit and all succeeding digits. Thus 1.7320 has five significant digits. 0.0491 has only three significant digits. It is not transparent to me why this definition is sensible .

§ Correct Significant digits --- a first stab


We can naively define x^\hat x agrees to xx upto pp significant digits if x^\hat x and xx round to the same number upto pp significant digits. This definition is seriously problematic. Consider the numbers:

Here, yy has correct one and three significant digits relative to xx, but incorrect 2 significant digits, since the truncation at x2x_2 and y2y_2 do not agree even to the first significant digit.

§ Correct Significant digits --- the correct definition


We say that x^\hat x agress to xx upto pp significant digits if xx^|x - \hat x| is less than half a unit in the pth significant digit of xx.

§ Accuracy v/s precision



Accuracy is not limited by precision : By using fixed precision arithmetic, we can emulate arbitrary precision arithmetic. The problem is that often this emulation is too expensive to be useful.

§ Backward, Forward errors


Let y=f(x)y = f(x), where f:RRf: \mathbb R \rightarrow \mathbb R. Let us compute y^\hat y as an approximation to yy, in an arithmetic of precision uu. How do we measure the quality of y^\hat y?
  1. In many cases, we maybe happy with an y^\hat y such that the relative error between yy and y^\hat y is equal to uu: we did the best we can with the precision that was given. This is the forward error .
  2. An alternative question we can ask is, for what δx\delta x do we have that y^=f(x+δx)\hat y = f(x + \delta x). That is, how far away from the input do we need to stray, to get a matching output? There maybe many such δx\delta x, so we ask for minδx\min |\delta x|. We can divide this error by xx as a normalization factor. This is the backward error .


There are two reasons we prefer backward error.
  1. It converts error analysis into "data analysis". The data itself tends to be uncertain. If the error given by the backward analysis is smaller than the data uncertainty, then we can write off our error as being too small. Since for all we know, we have 'fixed' the uncertain data with our small error.
  2. It reduces the question of error analysis into perturbation theory, which is very well understood for a large class of problems.

§ Backward stable


A method for computing y=f(x)y = f(x) is called backward stable if it produces a y^\hat y with small backward error. That is, we need a small δx\delta x such that y^=f(x+δx)\hat y = f(x + \delta x).

§ Mixed forward-backward error


We assume that addition and subtraction are backward stable, where uu is the number of significant digits to which our arithmetic operations can be performed:
x±y=x(1+Δ)±y(1+Δ)Δu x \pm y = x(1 + \Delta) \pm y(1 + \Delta) \forall |\Delta| \leq u

Another type of error we can consider is that of the form:
y^+δy=f(x+Δx) \hat y + \delta y = f(x + \Delta x)

That is, for a small perturbation in the output (δy)(\delta y), we can get a backward error of δx\delta x. This is called as mixed forward backward error .

We can say that an algorithm with mixed-forward-backward-error is stable iff:
y^+δy=f(x+Δx)Δy/y^<ϵΔx/x^<ηϵ,η are small \begin{aligned} &\hat y + \delta y = f(x + \Delta x) \\ &|\Delta y|/|\hat y| < \epsilon \\ &|\Delta x|/|\hat x| < \eta \\ &\text{$\epsilon, \eta$ are small} \end{aligned}

This definition of stability is useful when rounding errors are the dominant form of errors.

§ Conditioning


Relationship between forward and backward error is govered by conditioning : the sensitivity of solutions to perturbations of data. Let us have an approximate solution y^=f(x+δx)\hat y = f(x + \delta x). Then:
y^y=f(x+δx)f(x)=f(x)δx+O((δx)2)(y^y)/y=(xf(x)/f(x))(Δx/x)+O((Δx)2)(y^y)/y=c(x)(Δx/x)+O((Δx)2)c(x)xf(x)/f(x) \begin{aligned} &\hat y - y = f(x + \delta x) - f(x) = f'(x) \delta x + O((\delta x)^2) \\ &(\hat y - y)/y = (x f'(x)/f(x)) (\Delta x/x) + O((\Delta x)^2) \\ &(\hat y - y)/y = c(x) (\Delta x/x) + O((\Delta x)^2)\\ &c(x) \equiv |x f'(x)/f(x)| \end{aligned}

The quantity c(x)c(x) measures the scaling factor to go from the relative change in output to the relative change in input. Note that this is a property of the function ff, not any particular algorithm.

§ Example: logx\log x


If f(x)=logxf(x) = \log x, then c(x)=(x(logx))/logx=1/logxc(x) = |(x (\log x)') / \log x| = |1/\log x|. This quantity is very large for x1x \simeq 1. So, a small change in xx can produce a drastic change in logx\log x around 11.

§ Rule of thumb


We now gain access to the useful rule:
forward errorcondition number×backward error \text{forward error} \lesssim \text{condition number} \times \text{backward error}


§ Forward stable


If a method produces answers with forward errors of similar magnitude to those produced by a backward stable method, then it is called forward stable. Backward stability implies forward stability, but not vice-versa (TODO: why?)

§ Cancellation


Consider the following program:
#include <cmath>
#include <stdio.h>

int main() {
    double x = 12e-9;
    double c = cos(x);
    double one_sub_c = 1 - c;
    double denom = x*x;
    double yhat = one_sub_c / denom;

    printf("x:         %20.16f\n"
           "cx:        %20.16f\n"
           "one_sub_c: %20.16f\n"
           "denom:     %20.16f\n"
           "yhat:      %20.16f\n",
            x, c, one_sub_c, denom, yhat);
}

which produces the output:
x:           0.0000000120000000
cx:          0.9999999999999999
one_sub_c:   0.0000000000000001
denom:       0.0000000000000001
yhat:        0.7709882115452477

This is clearly wrong , because we know that (1cosx)/x2)1/2(1-\cos x)/x^2) \leq 1/2. The reason for this terrible result is that:
In general:
x1+ϵerror of order ϵy1x=ϵvalue of order ϵ \begin{aligned} &x \equiv 1 + \epsilon \text{error of order $\epsilon$} \\ &y \equiv 1 - x = \epsilon \text{value of order $\epsilon$} \\ \end{aligned}

That is, subtracting values close to each other (in this case, 11 and xx) converts error order of magnitude into value order of magnitude . Alternatively, it brings earlier errors into promience as values.

§ Analysis of subtraction


We can consider the subtraction:
x=ab;x^=a^b^a^=a(1+Δa)b^=b(1+Δb)xx^x=aΔabΔbab=aΔabΔbab=aΔa+bΔbabmax(Δa,Δb)(a+b)ab \begin{aligned} &x = a - b; \hat x = \hat a - \hat b \\ &\hat a = a(1 + \Delta a) \\ &\hat b = b(1 + \Delta b) \\ &\left| \frac{x - \hat x}{x} \right| \\ &= \left| \frac{-a\Delta a - b\Delta b}{a - b} \right| \\ &= \frac{|-a\Delta a - b\Delta b|}{|a - b|} \\ &= \frac{|a\Delta a + b\Delta b|}{|a - b|} \\ &\leq \frac{\max(|\Delta a|, |\Delta b|) (|a| + |b|)}{|a - b|} \end{aligned}

This quantity will be large when aba+b|a - b| \ll |a| + |b|: that is, when there is heavy cancellation in the subtraction to compute xx.

§ Underflow


#include <cmath>
#include <stdio.h>

int main() {
    double x = 1000;
    for(int i = 0; i < 60; ++i) {
        x = sqrt(x);
    }
    for(int i = 0; i < 60; ++i) {
        x = x*x;
    }
    printf("x: %10.20f\n", x);
}

This produces the output:
./sqrt-pow-1-12
...
x: 1.00000000000000000000

That is, even though the function is an identity function, the answer collapses to 1. What is happening?

§ Computing (ex1)/x(e^x - 1)/x

One way to evaluate this function is as follows:
double f(double x) { return x == 0 ? 1 : (pow(M_E, x) - 1) / x; }

This can suffer from catastrophic cancellation in the numerator. When xx is close to 00, exe^x is close to 1, and ex1e^x - 1 will magnify the error in exe^x.
double f(double x) {
   const double y = pow(M_E, x);
   return y == 1 ? 1 : (y - 1) / log(y);
}

This algorithm seems crazy, but there's insight in it. We can show that the errors cancel! The idea is that neither (y1)(y - 1) nor logy\log y are particularly good, the errors accumulated in them almost completely cancel out, leaving out a good value:
assume y^=11=y^ex(1+δ)log1=log(ex)+log(1+δ)x=log(1+δ)x=δ+O(δ2) \text{assume $\hat y = 1$} \\ 1 = \hat y \equiv e^x(1 + \delta) \\ \log 1 = \log (e^x ) + \log(1 + \delta) \\ x = -\log(1 + \delta) \\ x = -\delta + O(\delta^2)

If y^1\hat y \neq 1:
f^=(y^1)/logy^=(1+ϵ3)(y^1)(1+ϵ+1)/(logy^(1+ϵ2)) \hat f = (\hat y - 1)/\log{\hat y} = (1+\epsilon_3)(\hat y - 1)(1 + \epsilon+1)/(\log \hat y(1 + \epsilon_2))

§ IEEE floating point fun: +0 and -0 for complex analysis


Rather than think of +0 and -0 as distinct numerical values, think of
their sign bit as an auxiliary variable that conveys one bit of information
(or misinformation) about any numerical variable that takes on 0 as its
value.

We have two types of zeroes, +0 and -0 in IEEE-754. These are used in some cases. The most famous is that 1/+0=+1/+0 = +\infty, while 1/0=1/-0 = -\infty. Here, we proceed to discuss some complex-analytic considerations.
Therefore. implementers of compilers
and run-time libraries bear a heavy burden of attention to detail
if applications programmers are to realize the full benefit of the
IEEE style of complex arithmetic. That benefit deserves Some
discussion here if only to reassure implementers that their
assiduity will be appreciated.

1+0i=+0+i10i=+0i \sqrt{-1 + 0 i} = +0 + i \\ \sqrt{-1 - 0 i} = +0 - i \\
These will ensure that z=(z)\sqrt{z*} = (\sqrt{z})*:
copysign(1,+0)=+1copysign(1,0)=1 \texttt{copysign}(1, +0) = +1 \\ \texttt{copysign}(1, -0) = -1 \\

These will ensure that copysign(x,1/x)=x\texttt{copysign}(x, 1/x) = x when x=±x = \pm \infty.
An example is provided where the two limits:
f(x+i0)=limy0f(x+iy)f(x+i0)=limy0f(x+iy) \begin{aligned} &f(x + i0) = \lim_{y \rightarrow 0-} f(x + i y) \\ &f(x + i0) = \lim_{y \rightarrow 0-} f(x + i y) \\ \end{aligned}

§ Complex-analytic considerations


The principal branch of a complex function is a way to select one branch of a complex-function, which tends to be multi-valued. A classical example is the argument function, where arg(reiθ=θ\arg(r e^{i \theta} = \theta. However, this is ambiguous: we can map θθ+2π\theta \mapsto \theta + 2 \pi and still have the same complex number. So, we need to fix some standard. We usually pick the "branch" where 0θ<2π0 \leq \theta < 2 \pi. In general, we need to carefully handle what happens to the function at the discontinuity.
What deserves to be undermined is blind faith in the power of Algebra. We
should not believe that the equivalence class of expressions that all
describe the same complex analytic function can be recognized by algebraic
means alone, not even if relatively uncomplicated expressions are the only
ones considered.

§ References