# Gaussian Process limits of Bayesian Neural Networks

Bayesian Machine Learning
Gaussian Processes
Bayesian Neural Networks

In Neal (2012), it was shown that Bayesian Neural Networks (BNNs) with Gaussian weight priors $$w_i\sim\mathcal{N}(0,\sigma^2)$$ and one hidden layer converge to Gaussian Processes (GPs) in the limit of an infinitely wide hidden layer.

## Original results from R. Neal

For now, we define our Neural Network as follows: \begin{aligned} & f(x)=b+\sum_{j=1}^H v_j h_j(x) \\ & h_j(x)=\tanh \left(a_j+\sum_{i=1}^1 u_{i j} x_i\right) \end{aligned} (Neal allows for multiple outputs as well - we’ll stick to single-dimensional output for now).

Presume independent Gaussian priors over weights and biases: \begin{aligned} v_j &\sim \mathcal{N}\left(0, \sigma_v^2\right) \\ b &\sim \mathcal{N}\left(0, \sigma_b^2\right) \\ a_j &\sim \mathcal{N}\left(0, \sigma_a^2\right) \\ u_{i j} &\sim \mathcal{N}\left(0, \sigma_v^2\right) \end{aligned}

For a given $$x$$, we get

\begin{aligned} \mathbb { E }[f(x)] & =\mathbb{E}\left[b+\sum_{j=1}^H v_j h_j(x)\right] \\ & =\underbrace{\mathbb{E}[b]}_{=0}+\sum_{j=1}^H \mathbb{E}\left[v_j h_j(x)\right] \\ & =\sum_{j=1}^H \underbrace{\mathbb{E}\left[v_j\right]}_{=0} \mathbb{E}\left[h_j(x)\right]\left(\text{ by independence of priors }\right) \\ & =0 \end{aligned}

Since the means of $$f(x)$$ and the weight priors are zero, we can calculate variances via second moments: \begin{aligned} \operatorname{Var}(f(x)) & =\operatorname{Var}\left(b+\sum_{j=1}^H v_j h_j(x)\right) \\ & =\underbrace{\operatorname{Var}(b)}_{=\sigma_b^2}+\sum_{j=1}^H \operatorname{Var}\left(v_j h_j(x)\right) \\ & =\sigma_b^2+\sum_{j=1}^H \mathbb{E}\left[v_j^2\right] \underbrace{\mathbb{E}\left[h_j(x)^2\right]}_{:=V(x)} \\ & =\sigma_b{ }^2+\sum_{j=1}^H \sigma_v{ }^2 V(x) \\ & =\sigma_b^2+H \sigma_v{ }^2 V(x) \end{aligned}

$$\mathbb{E}\left[h_j(x)^2\right]$$ is finite due to boundedness

Using similar reasoning, we can derive the covariance for two distinct inputs, $$x_1$$, $$x_2$$:

\begin{aligned} \operatorname{Cov}\left(f\left(x_1\right), f\left(x_2\right)\right)&= \mathbb{E}\left[f\left(x_1\right) f\left(x_2\right)\right]-\underbrace{\mathbb{E}\left[f\left(x_1\right)\right]}_{=0} \underbrace{\mathbb{E}\left[f\left(x_2\right)\right]}_{=0} \\ &=\mathbb{E}\left[\left(b+\sum_{j=1}^H v_j h_j\left(x_1\right)\right) \cdot\left(b+\sum_{j=1}^H v_j h_j\left(x_2\right)\right)\right] \\ &=\mathbb{E}\left[b^2\right]+\mathbb{E}\left[b \sum_{j=1}^H v_j h_j\left(x_1\right)\right]+\mathbb{E}\left[b \sum_{j=1}^H v_j h_j\left(x_2\right)\right] \\ &\quad+\mathbb{E}\left[\sum_{j=1}^H v_j h_j\left(x_1\right) \sum_{j=1}^H v_j h_j\left(x_2\right)\right]\\ & =\sigma_b^2+\underbrace{\mathbb{E}\left[b\right]}_{=0} \mathbb{E}\left[\sum_{j=1}^H v_j h_j\left(x_1\right)\right]+\underbrace{\mathbb{E}\left[b\right]}_{=0} \mathbb{E}\left[\sum_{j=1}^{H} v_{j} h_j\left(x_2\right)\right] \\ & \quad+\mathbb{E}\left[\sum_{j=1}^H v_j h_j\left(x_1\right) \sum_{j=1}^H v_j h_j\left(x_2\right)\right] \\ & =\sigma_b^2+\mathbb{E}\left[\sum_{j=1}^N v_j^2 h_j\left(x_1\right) h_j\left(x_2\right)\right] +2\underbrace{\mathbb{E}\left[\sum_{i \neq j} v_j v_i h_j\left(x_1\right) h_i\left(x_2\right)\right]}_{=0(\text { independence })} \\ & =\sigma_b^2+\sum_{j=1}^H \mathbb{E}\left[v_j^2\right] \mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right] \text { (independence) } \\ & =\sigma_b^2+\sum_{j=1}^H \sigma_v^2 \underbrace{\mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right]}_{:=C\left(x_1, x_2\right) \forall j}\\ &=\sigma_b^2+H \sigma_v^2 C\left(x_1, x_2\right), \end{aligned}

where $$\mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right]=C\left(x_1, x_2\right)$$ is equivalent for all $$j$$ due to i.i.d. distributions in the inputs to the hidden layer.

By properly scaling $$\sigma_v^2$$ as $$\sigma_v=\frac{\omega_v}{\sqrt{H}}$$, this yields proper Gaussian distribution in the limit $$H\rightarrow\infty$$ via the Central Limit Theorem where

$Var(f(x))=\sigma_b^2+H \sigma_v^2 V(x)\rightarrow \sigma_b^2+\omega_v^2 V(x)$

$Cov(f(x_1),f(x_2))=\sigma_b^2+H \sigma_v^2 C\left(x_1, x_2\right)\rightarrow \sigma_b^2+ \omega_v^2 C\left(x_1, x_2\right)$

Notice that we can steer the (co-)variance of the corresponding GP by our choice of $$h(\cdot)$$. We can write

\begin{aligned} C\left(x_1, x_2\right)&= \mathbb{E}\left[h_j\left(x_1\right) h_j\left(x_2\right)\right] \\ &= \mathbb{E}\left[\frac { 1 } { 2 } \left\{h_j\left(x_1\right) h_j\left(x_2\right)+h_j\left(x_1\right) h_j\left(x_2\right)+h_j\left(x_1\right)^2-h_j\left(x_1\right)^2\right.\right. \\ & \left.\left. \quad+h_j\left(x_2\right)^2-h_j\left(x_2\right)^2\right\}\right] \\ &= \mathbb{E}\left[\frac{1}{2}\left\{h_j\left(x_1\right)+h_j\left(x_2\right)-\left(h_j\left(x_1\right)-h_j\left(x_2\right)\right)^2\right\}\right] \\ &= \frac{1}{2}\Big\{\underbrace{\mathbb{E}\left[h_j\left(x_1\right)^2\right]}_{=V\left(x_1\right)}+\underbrace{\mathbb{E}\left[h_j\left(x_2\right)^2\right]}_{=V\left(x_2\right)}-\underbrace{\left.\mathbb{E}\left[\left(h_j\left(x_1\right)-h_j\left(x_2\right)\right)^2\right]\right\}}_{=D\left(x_1, x_2\right)} \\ &= \frac{1}{2}\left\{V\left(x_1\right)+V\left(x_2\right)-D\left(x_1, x_2\right)\right\} \end{aligned}

If $$x_1$$ and $$x_2$$ are reasonably close to each other, we have $$V(x_1)\approx V(x_2):=V$$ and

$C\left(x_1, x_2\right)\approx V-\frac{1}{2}D(x_1,x_2)$

Then, the behavior of nearby observations is mostly determined by the expected mean squared distance

## References

Neal, Radford M. 2012. Bayesian Learning for Neural Networks. Vol. 118. Springer Science & Business Media.