A Universe of Sorts

§ Differentiating through sampling from a random normal distribution

Credits to [Edward Eriksson) for teaching me this.
The key idea is that since we can write the normal distribution with parameters mean $\mu$ and variance $\sigma$ as a function of the standard normal distribution. We then get to believe that the standard

$y = f(\sigma z)$ where $z \sim N(0, 1)$ .
Then, by treating $z$ as a constant, we see that $dy/d\sigma = f'(\sigma z) \cdot z$ by chain rule.
That is, we treat $z$ as "constant", and minimize the $\sigma$ .
My belief in this remains open until I can read a textbook, but I have it on good authority that this is correct.
How does this relate to the VAE optimisation? It's the same trick, where we claim that $sample(N(0, 1))$ can be held constant during backprop, as if the internal structure of the $sample$ function did not matter. Amazing.

#!/usr/bin/env python3
import numpy as np

sigma = 1.0

# # function we are minimising over
# def f (x): return - x*x
# # derivative of function we are minimising over
# def fprime(x): return -2*x

# function we are minimising over
def f (x): return np.sin(x + 0.1)

# derivative of function we are minimising over
def fprime(x): return np.cos(x + 0.1)

# f(sigma z) = f'(sigma z) z.
# \partial_\sigma E[f(X_\sigma)] = E[\partial_\sigma f(X_\sigma)]
for i in range(1000):
    z = np.random.normal(0, 1)
    # sample from normal distribution with mean 0 and standard deviation sigma
    sz = sigma * z
    # evaluate function at x
    fx = f(sz)
    gradfx = fprime(sz)

    # update sigma
    # z2 = np.random.normal(0, 1)
    dsigma = gradfx * z

    print("z = %5.2f | f = %6.2f | df = %6.2f | sigma = %6.2f | dsigma = %6.2f" %
        (z, fx, gradfx, sigma, dsigma))
    sigma = sigma - 0.01 * dsigma