Gaussian Process limits of Bayesian Neural Networks

Bayesian Machine Learning
Gaussian Processes
Bayesian Neural Networks

In Neal (2012), it was shown that Bayesian Neural Networks (BNNs) with Gaussian weight priors \(w_i\sim\mathcal{N}(0,\sigma^2)\) and one hidden layer converge to Gaussian Processes (GPs) in the limit of an infinitely wide hidden layer.

Original results from R. Neal

For now, we define our Neural Network as follows: \[ \begin{aligned} & f(x)=b+\sum_{j=1}^H v_j h_j(x) \\ & h_j(x)=\tanh \left(a_j+\sum_{i=1}^1 u_{i j} x_i\right) \end{aligned} \] (Neal allows for multiple outputs as well - we’ll stick to single-dimensional output for now).

Presume independent Gaussian priors over weights and biases: \[ \begin{aligned} v_j &\sim \mathcal{N}\left(0, \sigma_v^2\right) \\ b &\sim \mathcal{N}\left(0, \sigma_b^2\right) \\ a_j &\sim \mathcal{N}\left(0, \sigma_a^2\right) \\ u_{i j} &\sim \mathcal{N}\left(0, \sigma_v^2\right) \end{aligned} \]

For a given \(x\), we get

\[ \begin{aligned} \mathbb { E }[f(x)] & =\mathbb{E}\left[b+\sum_{j=1}^H v_j h_j(x)\right] \\ & =\underbrace{\mathbb{E}[b]}_{=0}+\sum_{j=1}^H \mathbb{E}\left[v_j h_j(x)\right] \\ & =\sum_{j=1}^H \underbrace{\mathbb{E}\left[v_j\right]}_{=0} \mathbb{E}\left[h_j(x)\right]\left(\text{ by independence of priors }\right) \\ & =0 \end{aligned} \]

Since the means of \(f(x)\) and the weight priors are zero, we can calculate variances via second moments: \[ \begin{aligned} \operatorname{Var}(f(x)) & =\operatorname{Var}\left(b+\sum_{j=1}^H v_j h_j(x)\right) \\ & =\underbrace{\operatorname{Var}(b)}_{=\sigma_b^2}+\sum_{j=1}^H \operatorname{Var}\left(v_j h_j(x)\right) \\ & =\sigma_b^2+\sum_{j=1}^H \mathbb{E}\left[v_j^2\right] \underbrace{\mathbb{E}\left[h_j(x)^2\right]}_{:=V(x)} \\ & =\sigma_b{ }^2+\sum_{j=1}^H \sigma_v{ }^2 V(x) \\ & =\sigma_b^2+H \sigma_v{ }^2 V(x) \end{aligned} \]

\(\mathbb{E}\left[h_j(x)^2\right]\) is finite due to boundedness

Using similar reasoning, we can derive the covariance for two distinct inputs, \(x_1\), \(x_2\):

\[ \begin{aligned} \operatorname{Cov}\left(f\left(x_1\right), f\left(x_2\right)\right)&= \mathbb{E}\left[f\left(x_1\right) f\left(x_2\right)\right]-\underbrace{\mathbb{E}\left[f\left(x_1\right)\right]}_{=0} \underbrace{\mathbb{E}\left[f\left(x_2\right)\right]}_{=0} \\ &=\mathbb{E}\left[\left(b+\sum_{j=1}^H v_j h_j\left(x_1\right)\right) \cdot\left(b+\sum_{j=1}^H v_j h_j\left(x_2\right)\right)\right] \\ &=\mathbb{E}\left[b^2\right]+\mathbb{E}\left[b \sum_{j=1}^H v_j h_j\left(x_1\right)\right]+\mathbb{E}\left[b \sum_{j=1}^H v_j h_j\left(x_2\right)\right] \\ &\quad+\mathbb{E}\left[\sum_{j=1}^H v_j h_j\left(x_1\right) \sum_{j=1}^H v_j h_j\left(x_2\right)\right]\\ & =\sigma_b^2+\underbrace{\mathbb{E}\left[b\right]}_{=0} \mathbb{E}\left[\sum_{j=1}^H v_j h_j\left(x_1\right)\right]+\underbrace{\mathbb{E}\left[b\right]}_{=0} \mathbb{E}\left[\sum_{j=1}^{H} v_{j} h_j\left(x_2\right)\right] \\ & \quad+\mathbb{E}\left[\sum_{j=1}^H v_j h_j\left(x_1\right) \sum_{j=1}^H v_j h_j\left(x_2\right)\right] \\ & =\sigma_b^2+\mathbb{E}\left[\sum_{j=1}^N v_j^2 h_j\left(x_1\right) h_j\left(x_2\right)\right] +2\underbrace{\mathbb{E}\left[\sum_{i \neq j} v_j v_i h_j\left(x_1\right) h_i\left(x_2\right)\right]}_{=0(\text { independence })} \\ & =\sigma_b^2+\sum_{j=1}^H \mathbb{E}\left[v_j^2\right] \mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right] \text { (independence) } \\ & =\sigma_b^2+\sum_{j=1}^H \sigma_v^2 \underbrace{\mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right]}_{:=C\left(x_1, x_2\right) \forall j}\\ &=\sigma_b^2+H \sigma_v^2 C\left(x_1, x_2\right), \end{aligned} \]

where \(\mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right]=C\left(x_1, x_2\right)\) is equivalent for all \(j\) due to i.i.d. distributions in the inputs to the hidden layer.

By properly scaling \(\sigma_v^2\) as \(\sigma_v=\frac{\omega_v}{\sqrt{H}}\), this yields proper Gaussian distribution in the limit \(H\rightarrow\infty\) via the Central Limit Theorem where

\[Var(f(x))=\sigma_b^2+H \sigma_v^2 V(x)\rightarrow \sigma_b^2+\omega_v^2 V(x)\]

\[Cov(f(x_1),f(x_2))=\sigma_b^2+H \sigma_v^2 C\left(x_1, x_2\right)\rightarrow \sigma_b^2+ \omega_v^2 C\left(x_1, x_2\right)\]

Notice that we can steer the (co-)variance of the corresponding GP by our choice of \(h(\cdot)\). We can write

\[ \begin{aligned} C\left(x_1, x_2\right)&= \mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right] \\ &= \mathbb{E}\left[\frac { 1 } { 2 } \left\{h_j\left(x_1\right) h_j\left(x_2\right)+h_j\left(x_1\right) h_j\left(x_2\right)+h_j\left(x_1\right)^2-h_j\left(x_1\right)^2\right.\right. \\ & \left.\left. \quad+h_j\left(x_2\right)^2-h_j\left(x_2\right)^2\right\}\right] \\ &= \mathbb{E}\left[\frac{1}{2}\left\{h_j\left(x_1\right)+h_j\left(x_2\right)-\left(h_j\left(x_1\right)-h_j\left(x_2\right)\right)^2\right\}\right] \\ &= \frac{1}{2}\Big\{\underbrace{\mathbb{E}\left[h_j\left(x_1\right)^2\right]}_{=V\left(x_1\right)}+\underbrace{\mathbb{E}\left[h_j\left(x_2\right)^2\right]}_{=V\left(x_2\right)}-\underbrace{\left.\mathbb{E}\left[\left(h_j\left(x_1\right)-h_j\left(x_2\right)\right)^2\right]\right\}}_{=D\left(x_1, x_2\right)} \\ &= \frac{1}{2}\left\{V\left(x_1\right)+V\left(x_2\right)-D\left(x_1, x_2\right)\right\} \end{aligned} \]

If \(x_1\) and \(x_2\) are reasonably close to each other, we have \(V(x_1)\approx V(x_2):=V\) and

\[C\left(x_1, x_2\right)\approx V-\frac{1}{2}D(x_1,x_2)\]

Then, the behavior of nearby observations is mostly determined by the expected mean squared distance

References

Neal, Radford M. 2012. Bayesian Learning for Neural Networks. Vol. 118. Springer Science & Business Media.