Bayesian Machine Learning and Julia are a match made in heaven
Bayesian inference with less pain and even less code.

Introduction
As I argued in an earlier article, Bayesian Machine Learning can be quite powerful. Building actual Bayesian models in Python, however, is sometimes a bit of a hassle. Most solutions that you will find online are either relatively complex or require learning yet another domain specific language. The latter could easily constrain your expressiveness when you need a highly customized solution.
Doing Bayesian Machine Learning in Julia, on the other hand, allows you to mitigate both these issues. In fact, you just need a few lines of raw Julia code to build, for example, a Bayesian Neural Network for regression. Julia's Flux and Turing packages will then handle the heavy workload under the hood.
Hence today, I want to show you how to implement and train a Bayesian Neural Network in less than 30 lines of Julia. Before showing you the code, let us briefly recall the main theoretical aspects:
Bayesian Machine Learning in three steps
As always, we want to find a posterior distribution via Bayes' law:

As the data term in the denominator is a constant, we can simplify the above:

To avoid confusion, let us use the following standard wording:

For Bayesian Neural Network regression, we further specify the likelihood function:

This denotes a product of independent normal distributions with means defined by the outputs of a Neural Network. The variance of the Normal distribution is chosen to be a constant.
The corresponding prior distribution could look as follows:

The priors for K networks weights are independent standard normal distributions. For the square root of the variance (a.k.a. standard deviation), we use a standard Gamma distribution. So, from a theory perspective, we are all set up and ready to go.
Ideally, we now want to implement the Bayesian Neural Network in the following steps:
- Define the likelihood function
- Define the prior distribution
- Train the model
Having these three steps separate from each other in the code, will help us to
- Maintain readability - Besides the corresponding functions being smaller, a potential reader can also easier discern the likelihood from the prior.
- Keep the code testable at a granular level - Likelihood and prior distribution are clearly separate concerns. Thus, we should also be able to test them individually.
With this in mind, let us start building the model in Julia.
Defining the likelihood function
The Flux
library provides everything we need to build and work with Neural Networks. It has Dense
to build feedforward layers and Chain
to combine the layers into a network. In fact, they work just like the Layer
and Network
structs that we defined in this article.
Our Likelihood
struct therefore consists of the Neural Network, network
, and the standard deviation, sigma
. In the feedforward pass, we use the network's output and sigma
to define conditional mean and standard deviation of the Gaussian likelihood:
The dot in Normal.(...)
lets us define one Normal distribution per network output, each with standard deviation sigma
. We could combine this with logpdf(...)
from the Distributions library in order to train the model with maximum likelihood gradient descent. To perform Bayesian Machine Learning, however, we need to add a few more elements.
This leads us to the central function of this article, namely Flux.destructure()
. From the documentation:
destructure(m)
Flatten a model's parameters into a single weight vector.
julia> m = Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
julia> θ, re = destructure(m);
julia> θ
67-element Vector{Float32}:
-0.1407104
...
The second return value re allows you to reconstruct the original network after making modifications to the weight vector (for example, with a hypernetwork).
julia> re(θ .* 2)
Chain(Dense(10, 5, σ), Dense(5, 2), softmax)
In summary, destructure(...)
takes an instantiated model struct and returns a tuple with two elements:
- The model parameters, concatenated into a single vector
- A reconstructor function that takes a parameter vector as in 1. as input and returns the model with those parameters
The latter is important as we can feed an arbitrary parameter vector to the reconstructor. As long as its length is valid, it returns the corresponding model with the given parameter configuration. In code:
The last function will allow us to provide weights and standard deviation parameters separately to the reconstructor. This is a necessary step in order for Turing
to handle the Bayesian inference part.
From here, we are ready to move to the prior distribution.
Defining the prior distribution
This part is very short - we only need to define the prior distributions for the weight vector and the standard deviation scalar:
Having defined both likelihood and prior, we can take samples from the prior predictive distribution,

While this might look complicated as a formula, we are basically just drawing Monte Carlo samples:
The prior predictive distribution itself includes the noise from sigma
. Prior predictive draws from the network alone, i.e. the prior predictive mean, yield nice and smooth samples:

Now, we can actually train the model.
Training the Bayesian Neural Network
For this example, we'll be using synthetic data, sampled from

The latter factor denotes a uniform density over (-2,2).

In order to use Turing
, we need to define a model as explained in their documentation. Applied on our example, we get the following:
small, Hamiltonian Monte Carlo (HMC) is a suitable choice. In fact, HMC is generally considered the gold standard algorithm for Bayesian Machine Learning. Unfortunately, it becomes quite inefficient in high dimensions.
Nevertheless, we now use HMC via Turing and collect the resulting draws from the MCMC posterior:
From here, we can visualize the full posterior predictive distribution,

This is done in a similar fashion as for the prior predictive distribution (star-variables denote new inputs outside the training set). The only difference is that we now use the samples from the MCMC posterior distribution.

Using the above example, it is easy to try out other prior distributions.
Plug-and-play with different prior distributions
As another big advantage, Turing
can use almost all distributions from the Distributions
library as a prior. This also allows us to try out some exotic weight priors, say a Semicircle distribution with radius 0.5. All we have to do is replace the Gaussian prior:
With the same setup as before, we get the following posterior predictive distribution:

Semicircle(0.5)
prior weight distributions.The possibilities obviously don't stop here. Another fruitful adjustment could be the introduction of hyperprior distributions, e.g. over the standard deviation of the weight priors.
Using a model different from a Neural Network is also quite simple. You only need to adjust the Likelihood
struct and corresponding functions.
Conclusion
This article was a brief introduction to Bayesian Machine Learning with Julia. As matter of fact, Julia is not just fast but can also make coding much easier and more efficient.
While Julia is still a young language with some caveats to deploying Julia programs to production, it is definitely an awesome language for research and prototyping. Especially the seamless interoperability between different libraries can considerably shorten iteration cycles both inside and outside of academia.
At this point, I am also quite hopeful that we will, at some point in the near future, see more Julia being used in industry-grade production code.
References
[1] Bezanson, Jeff, et al. Julia: A fast dynamic language for technical computing. arXiv preprint arXiv:1209.5145, 2012.
[2] Ge, Hong; Xu, Kai; Ghahramani, Zoubin. Turing: a language for flexible probabilistic inference. In: International conference on artificial intelligence and statistics. PMLR, 2018. p. 1682-1690.
Comments ()