Weisser Zwerg Logo

Weisser Zwerg

Model Comparison via Bayes Factor

Published by Weisser Zwerg Blog on

Especially for unsupervised learning use-cases model comparison via bayes factor helps you select the best model.

Rational

If you have a use-case that involves supervised training then methods like cross-validation help you to decide which model works best for your data. But if you have to deal with unsupervised learning use-cases (like density estimation or similar) then a technique like comparing the bayes factor for different models may help you to select the “best” model.

While the technique is in principle simple seeing a few examples in code may help you to get started.

Notation / Level of Detail

In my opinion notation is important. My very personal, subjective feeling is that notation of probability theory is lacking[1]. Often I only understand what the math equations want to tell me when reading the surrounding text. Good math notation would not require the surrounding text, but only the math equations.

That is why I decided to be very explicit and “slow” in my explanation of the topic below. Seeing code also may help a lot.

Code

You can find the associated code on github or you browse directly the model-comparison-via-bayes-factor.ipynb jupyter notebook.

Coin flip example

The first time I learned about model comparison and the bayes factor was when I read Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan chapter 10 “Model Comparison and Hierarchical Modeling”. Therefore, the example I’d like to start with is the example used in the book in chapter 10.2 “Two Factories of Coins” (p. 268). You can find the full details of the example in the book, but just quickly: imagine that there are two different factories of coins, the first one producing tail-biased coins and the second one producing head-biased coins. We model the tail-biased coin factory as having a tail-bias distribution θbeta(θω1(κ2)+1,(1ω1)(κ2)+1)\theta\sim\mathrm{beta}(\theta\,|\,\omega_1(\kappa-2)+1,(1-\omega_1)(\kappa-2)+1) with ω1=0.25,κ=12\omega_1=0.25, \kappa=12 and then sampling from a bernoulli-distribution with p=θp=\theta. Similarly we model the head-biased coin factory via θbeta(θω2(κ2)+1,(1ω2)(κ2)+1)\theta\sim\mathrm{beta}(\theta\,|\,\omega_2(\kappa-2)+1,(1-\omega_2)(\kappa-2)+1) with ω2=0.75,κ=12\omega_2=0.75, \kappa=12. Don’t worry if you’re not fluent in math, we do the calculations below in numpy. Important is that the book solution gives us values to compare against.

The full model is therefore:

pbeta(a1/2,b1/2)Xbernoulli(p)\begin{array}{lcl} p&\sim&\mathrm{beta}(a_{1/2},b_{1/2})\\ X&\sim&\mathrm{bernoulli}(p) \end{array}

where

a1/2=ω1/2(κ2)+1b1/2=(1ω1/2)(κ2)+1\begin{array}{lcl} a_{1/2}&=&\omega_{1/2}(\kappa-2)+1\\ b_{1/2}&=&(1-\omega_{1/2})(\kappa-2)+1 \end{array}

Model comparison then works by comparing the value of the likelihood function for the given data after integrating out all model parameters. The likelihood function is the model probability distribution function seen as a function of the parameters θ\theta (be aware that θ\theta stands for a group/set of parameters) and not seen as a function of the data.

Let’s do this slowly. Let’s first construct the probability distribution:

Pr(X=H/T,p=θ)=Pr(X=H/Tp=θ)Pr(θ)=bernoulli(X=H/Tp)beta(a,b)\mathrm{Pr}(X=H/T,p=\theta)=\mathrm{Pr}(X=H/T\,|\,p=\theta)\mathrm{Pr}(\theta)=\mathrm{bernoulli}(X=H/T\,|\,p)\cdot\mathrm{beta}(a,b)

The first probability distribution Pr\mathrm{Pr} on the very left is a joint probability distribution of the data XX (which can take on the values HH (head) or TT (tail)) and the model parameters (here only one parameter pp). As you do not know pp for sure you have to split that joint probability distribution in a part that gives you the probability of the data under the assumption that you know the parameter(s) (this is a conditional probability distribution written as Pr(......)\mathrm{Pr}(...\,|\,...)) and the so called prior distribution for the parameter(s), in our case Pr(θ)\mathrm{Pr}(\theta). In our case the parameter pp has to be in the interval between 0 and 1: p[0,1]p\in[0,1]. A probability distribution that ensures that the sampled values are continuous and fall into the interval between 0 and 1 is the beta\mathrm{beta} distribution. The beta\mathrm{beta} distribution is for beta(a=1,b=1)\mathrm{beta}(a=1,b=1) flat, e.g. beta(a=1,b=1)\mathrm{beta}(a=1,b=1) might be an adequate choice for the prior if you’d say “I have no clue, whatsoever, what pp might be” (a flat distribution would give every value of pp the same prior probability). In our case, the example says, we have some clue and expect for example for the tail-biased factory pp to be distributed around 0.250.25 with a breadth given by the concentration parameter κ\kappa.

The above is the probability distribution for one data-point. Now in our example, as given by the book, we have 9 data-points and it says that we have 6 heads and 3 tails, e.g. [T,T,T,H,H,H,H,H,H][T,T,T,H,H,H,H,H,H] (or some permutation thereof). In addition we assume that each data-point is independent from the other data-points. This is called an i.i.d. (independent and identically distributed) random variable. For i.i.d. random variables the overall probability distribution is given by taking the prior and multiplying it with the conditional probability distribution once for each data-point. In math this is:

Pr(X1=H/T,X2=H/T,X3=H/T,X4=H/T,X5=H/T,X6=H/T,X7=H/T,X8=H/T,X9=H/T,p=θ)=\mathrm{Pr}(X_1=H/T,X_2=H/T,X_3=H/T,X_4=H/T,X_5=H/T,X_6=H/T,X_7=H/T,X_8=H/T,X_9=H/T,p=\theta)=

Pr(θ)i=19Pr(Xi=H/Tp=θ)\mathrm{Pr}(\theta)\prod_{i=1}^9\mathrm{Pr}(X_i=H/T\,|\,p=\theta)

Model comparison now uses the idea to compare the probability of the data under the model Pr(DM)\mathrm{Pr}(D\,|\,M). This boils down to integrating out the parameters that belong to the model to obtain a single probability value for the model pMp_M:

pM=L(D,θ)dθ=Pr(D,θ)dθp_M=\int L(D,\theta)\mathrm{d}\theta=\int\mathrm{Pr}(D,\theta)\mathrm{d}\theta

Let’s see this in action via python code:

# Flip coin 9 times and get 6 heads
samples = np.array([0,0,0,1,1,1,1,1,1])

def fn(a, b):
return lambda p: stats.bernoulli(p).pmf(samples).prod() * stats.beta(a,b).pdf(p)

scipy.integrate.quad(fn(a,b),0.0,1.0)

As the scipy.stats probability distribution functions (like bernoulli or beta) are functions of the data and not of the parameters we have to make the detour of first creating a function fn that returns a function of the parameter p. If we call fn(a,b) it returns a function that takes a single argument p and returns the probability of the model for exactly that value of p. scipy.integrate.quad then does the job of integrating that function between the boundaries p[0.0,1.0]p\in[0.0,1.0].

If we then look at concrete values of pMp_M we get the following:

def ok2ab(omega, kappa):
return omega*(kappa-2)+1,(1-omega)*(kappa-2)+1

a1,b1 = ok2ab(omega1,kappa1)
a2,b2 = ok2ab(omega2,kappa2)

scipy.integrate.quad(fn(a1,b1),0.0,1.0)
> (0.0004993438720702594, 2.322913347907592e-09)

scipy.integrate.quad(fn(a2,b2),0.0,1.0)
> (0.002338556142956198, 2.1998184481120156e-11)

The ok2ab function is just a helper to convert from an ω,κ\omega,\kappa parametrization of the beta()\mathrm{beta}() function to the a,ba, b parametrization that the scipy.stats.beta implementation of the beta()\mathrm{beta}() function requires. The value we get for the tail-biased model is pMT=0.000499p_{M_T}=0.000499 and the value we get for the head-biased model is pMH=0.002339p_{M_H}=0.002339, which match exactly the values as given in Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan p.271.

The bayes-factor is now simply the relation between the two: BF=pMH/pMT=4.68\mathrm{BF}=p_{M_H}/p_{M_T}=4.68. The interpretation of this value differs in Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan p.268 from the one given in wikipedia. In the book they say that a factor bigger than 3.03.0 or less than 1/31/3 should be taken as “substantial” evidence for or against the model. In wikipedia the “substantial” classification starts only at 5.05.0. But in all real-world scenarios I’ve seen so far this factor is so huge that these “tiny” differences do not matter.

Just to mention it: for this simple example model there is an analytical solution to calculating pMp_M and in the notebook this analytical solution is implemented via the pD(z,N,a,b) function (pD stands for “probability of the data”).

PyMC3

The above simple example was hopefully instructive enough so that we can now move quicker. Let’s see how you would do model comparison via the bayes-factor method in a PyMC3 model. I’ve simply adapted the example as given in the PyMC3 documentation: Bayes Factors and Marginal Likelihood (be aware that the interface seems to have changed between PyMC3 3.7 and PyMC3 3.8 and I used the newer interface).

priors = ((a1, b1), (a2, b2))
models = []
traces = []
for alpha, beta in priors:
with pm.Model() as model:
a = pm.Beta('a', alpha, beta)
yl = pm.Bernoulli('yl', a, observed=samples)
trace = pm.sample_smc(4000, n_steps=100, random_seed=42)
models.append(model)
traces.append(trace)
models[0].marginal_likelihood
> 0.0004979784247939404
models[1].marginal_likelihood
> 0.002364542043803754
BF_smc = models[1].marginal_likelihood / models[0].marginal_likelihood
BF_smc
> 4.75

The Sequential Monte Carlo sampler as implemented in PyMC3 gives you a model property .marginal_likelihood. This property is the probability of the data given the model (what we called Pr(DM)\mathrm{Pr}(D\,|\,M) above). This is only an estimate, but quite useful and close enough for most of my purposes.

The so called bayesian information criterion (BIC) is an approximation of the Pr(DM)\mathrm{Pr}(D\,|\,M). There are other approximations like the Akaike information criterion (AIC) that are similar.

PyMC3 implements in its model comparison functionality the widely-applicable information criterion (WAIC). I did not read the details yet (this is on my TODO list), but for model-comparison purposes these approximations may be useful, too.

PyMC3 log-likelihood at point-estimate

In case that you’d like to compare the log likelihood of models at a point estimate (e.g. at the point estimate of each free random variable’s mean value) you can do that via the model’s .logp() method. As PyMC3 is transforming the free variables to facilitate the sampling process you have to take care of the transformations yourself.

Let’s assume the following sample data and a model of a simple Gauß distribution:

samples2 = stats.norm(loc=-3, scale=0.5).rvs(size=1000, random_state=np.random.RandomState(42))

with pm.Model() as mcmc_model:
mcmc_model_m = pm.Uniform('mean', lower=-10., upper=10.) # , transform=None
mcmc_model_s = pm.Uniform('s', lower=0.1, upper=5.0) # , transform=None
mcmc_model_gs = pm.Normal('gs', mu=mcmc_model_m, sigma=mcmc_model_s, observed=samples2)

with mcmc_model:
mcmc_model_trace = pm.sample(2000, tune=1000, init='adapt_diag')

We can calculate the model log-likelihood at the point-estimate of the underlying parameters μ=3.0\mu=-3.0 and σ=0.5\sigma=0.5 (assuming a flat improper prior) via:

stats.norm(loc=-3, scale=0.5).logpdf(samples2).sum()
> -704.9307117016448

With PyMC3 you could do it like:

mcmc_model_gs.logp({'mean_interval__': mcmc_model_m_d.transformation.forward(-3.0).eval(), 's_interval__': mcmc_model.deterministics[1].transformation.forward(0.5).eval()})
> array(-704.9307117)

Or if you’d like to use the proper prior:

mcmc_model.logp({'mean_interval__': mcmc_model_m_d.transformation.forward(-3.0).eval(), 's_interval__': mcmc_model.deterministics[1].transformation.forward(0.5).eval()})
> array(-709.00200049)

As said above already: as PyMC3 is transforming the free variables to facilitate the sampling process you have to take care of the transformations yourself.

If you’d like to take the point-estimate at the estimated values of the parameters (e.g. their means), you could to it as follows:

mcmc_model_gs.logp(dict([[varobj.name, mcmc_model_trace[varobj.name].mean()] for varobj in mcmc_model.free_RVs]))
> array(-704.28913595)

All the code is in the notebook so go there to see all the details. I just showed you the key pieces.

More advanced use-cases

In even more advanced use-cases, like in generative deep models, the probability distributions are often only given “implicitly” as explained in Deep Learning. You may be able to construct an MCMC (Markov chain Monte Carlo) chain to draw samples from such a deep network like for example shown in chapter 20.11 “Drawing samples from Autoencoders” in Deep Learning. Once you can draw samples and at least can approximate the probability distribution you can use the method as given in chapter 18.7 “Estimating the Partition Function” in Deep Learning to get an estimate of the model log-likelihood via importance sampling.


  1. To be fair: the notation is not lacking, but it is so verbose that most people abbreviate it and then it becomes lacking. I have no proposal for an improvement, though. ↩︎

Have you written a response to this? Let me know the URL via telegraph.

No mentions yet.