A Universe of Sorts

§ Comparison of forward and reverse mode AD

Quite a lot of ink has been spilt on this topic. My favourite reference is the one by Rufflewind . However, none of these examples have a good stock of examples for the diference. So here, I catalogue the explicit computations between computing forward mode AD and reverse mode AD. In general, in forward mode AD, we fix how much the inputs wiggle with respect to a parameter

t

. We figure out how much the output wiggles with respect to

t

. If

output = f(input_1, input_2, \dots input_n)

, then

\frac{\partial output}{\partial t} = \sum_i \frac{\partial f}{\partial input_i} \frac{\partial input_i}{\partial dt}

. In reverse mode AD, we fix how much the parameter

t

wiggles with respect to the output. We figure out how much the parameter

t

wiggles with respect to the inputs. If

output_i = f_i(input, \dots)

, then

\frac{\partial t}{\partial input} = \sum_i \frac{\partial t}{\partial output_i} \frac{\partial f_i}{input}

. This is a much messier expression, since we need to accumulate the data over all outputs. Essentially, deriving output from input is easy, since how to compute an output from an input is documented in one place. deriving input from output is annoying, since many outputs can depent on a single output. The upshot is that if we have few "root outputs" (like a loss function), we need to run AD once with respect to this, and we will get the wiggles of all inputs at the same time with respect to this output, since we compute the wiggles output to input. The first example of z = max(x, y) captures the essential difference between the two approached succinctly. Study this, and everything else will make sense.

§ Maximum: `z = max(x, y)`

Forward mode equations:

\begin{aligned} z &= max(x, y) \\ \frac{\partial x}{\partial t} &= ? \\ \frac{\partial y}{\partial t} &= ? \\ \frac{\partial z}{\partial t} &= \begin{cases} \frac{\partial x}{\partial t} & \text{if $x > y$} \\ \frac{\partial y}{\partial t} & \text{otherwise} \\ \end{cases} \end{aligned}

We can compute

\frac{\partial z}{\partial x}

by setting

t = x

. That is,

\frac{\partial x}{\partial t} = 1, \frac{\partial y}{\partial t} = 0

. Similarly, can compute

\frac{\partial z}{\partial y}

by setting

t = y

. That is,

\frac{\partial x}{\partial t} = 1, \frac{\partial y}{\partial t} = 0

. If we want both gradients

\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y}

, we will have to rerun the above equations twice with the two initializations. In our equations, we are saying that we know how sensitive the inputs

x, y

are to a given parameter

t

. We are deriving how sensitive the output

z

is to the parameter

t

as a composition of

x, y

. If

x > y

, then we know that

z

is as sensitive to

t

x

is.

Reverse mode equations:

\begin{aligned} z &= max(x, y) \\ \frac{\partial t}{\partial z} &= ? \\ \frac{\partial t}{\partial x} &= \begin{cases} \frac{\partial t}{\partial z} & \text{$if x > y$} \\ 0 & \text{otherwise} \end{cases} \\ \frac{\partial t}{\partial y} &= \begin{cases} \frac{\partial t}{\partial z} & \text{$if y > x$} \\ 0 & \text{otherwise} \end{cases} \end{aligned}

We can compute

\frac{\partial z}{\partial x}, \frac{\partial z}{\partial y}

in one shot by setting

t = z

. That is,

\frac{\partial z}{\partial t} = 1

. In our equations, we are saying that we know how sensitive the parameter

t

is to a given output

z

. We are trying to see how sensitive

t

is to the inputs

x, y

. If

x

is active (ie,

x > y

), then

t

is indeed sensitive to

x

and

\frac{\partial t}{\partial x} = 1

. Otherwise, it is not sensitive, and

\frac{\partial t}{\partial x} = 0

§ sin: `z = sin(x)`

Forward mode equations:

\begin{aligned} z &= sin(x) \\ \frac{\partial x}{\partial t} &= ? \\ \frac{\partial z}{\partial t} &= \frac{\partial z}{\partial x} \frac{\partial x}{\partial t} \\ &= cos(x) \frac{\partial x}{\partial t} \end{aligned}

We can compute

\frac{\partial z}{\partial x}

by setting

t = x

. That is, setting

\frac{\partial x}{\partial t} = 1

Reverse mode equations:

\begin{aligned} z &= sin(x) \\ \frac{\partial t}{\partial z} &= ? \\ \frac{\partial t}{\partial x} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial x} \\ &= \frac{\partial t}{\partial z} cos(x) \end{aligned}

We can compute

\frac{\partial z}{\partial x}

by setting

t = z

. That is, setting

\frac{\partial z}{\partial t} = 1

§ addition: `z = x + y`:

Forward mode equations:

\begin{aligned} z &= x + y \\ \frac{\partial x}{\partial t} &= ? \\ \frac{\partial y}{\partial t} &= ? \\ \frac{\partial z}{\partial t} &= \frac{\partial z}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial z}{\partial y} \frac{\partial y}{\partial t} \\ &= 1 \cdot \frac{\partial x}{\partial t} + 1 \cdot \frac{\partial y}{\partial t} = \frac{\partial x}{\partial t} + \frac{\partial y}{\partial t} \end{aligned}

Reverse mode equations:

\begin{aligned} z &= x + y \\ \frac{\partial t}{\partial z} &= ? \\ \frac{\partial t}{\partial x} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial x} \\ &= \frac{\partial t}{\partial z} \cdot 1 = \frac{\partial t}{\partial z} \\ \frac{\partial t}{\partial y} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial y} \\ &= \frac{\partial t}{\partial z} \cdot 1 = \frac{\partial t}{\partial z} \end{aligned}

§ multiplication: `z = xy`

Forward mode equations:

\begin{aligned} z &= x y \\ \frac{\partial x}{\partial t} &= ? \\ \frac{\partial y}{\partial t} &= ? \\ \frac{\partial z}{\partial t} &= \frac{\partial z}{\partial x} \frac{\partial x}{\partial t} + \frac{\partial z}{\partial y} \frac{\partial y}{\partial t} \\ &= y \frac{\partial x}{\partial t} + x \frac{\partial y}{\partial t} \end{aligned}

Reverse mode equations:

\begin{aligned} z &= x y \\ \frac{\partial t}{\partial z} &= ? \\ \frac{\partial t}{\partial x} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial x} = \frac{\partial t}{\partial z} \cdot y \\ \frac{\partial t}{\partial y} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial y} = \frac{\partial t}{\partial z} \cdot x \end{aligned}

§ subtraction: `z = x - y`:

Forward mode equations:

\begin{aligned} z &= x + y \\ \frac{\partial x}{\partial t} &= ? \\ \frac{\partial y}{\partial t} &= ? \\ \frac{\partial z}{\partial t} &= \frac{\partial z}{\partial x} \frac{\partial x}{\partial t} - \frac{\partial z}{\partial y} \frac{\partial y}{\partial t} \\ &= 1 \cdot \frac{\partial x}{\partial t} - 1 \cdot \frac{\partial y}{\partial t} = \frac{\partial x}{\partial t} - \frac{\partial y}{\partial t} \end{aligned}

Reverse mode equations:

\begin{aligned} z &= x - y \\ \frac{\partial t}{\partial z} &= ? \\ \frac{\partial t}{\partial x} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial x} \\ &= \frac{\partial t}{\partial z} \cdot 1 = \frac{\partial t}{\partial z} \\ \frac{\partial t}{\partial y} &= \frac{\partial t}{\partial z} \frac{\partial z}{\partial y} \\ &= \frac{\partial t}{\partial z} \cdot -1 = -\frac{\partial t}{\partial z} \end{aligned}

§ Comparison of forward and reverse mode AD

§ Maximum: z = max(x, y)

§ sin: z = sin(x)

§ addition: z = x + y:

§ multiplication: z = xy

§ subtraction: z = x - y:

§ Maximum: `z = max(x, y)`

§ sin: `z = sin(x)`

§ addition: `z = x + y`:

§ multiplication: `z = xy`

§ subtraction: `z = x - y`: