Iterative linearised INLA method

The INLA method for linear predictors

The INLA method is used to compute fast approximative posterior distribution for Bayesian generalised additive models. The hierarchical structure of such a model with latent Gaussian components u, covariance parameters θ, and measured response variables y, can be written as $$ \begin{aligned} \boldsymbol{\theta} &\sim p(\boldsymbol{\theta}) \\ \boldsymbol{u}|\boldsymbol{\theta} &\sim \mathcal{N}\!\left(\boldsymbol{\mu}_u, \boldsymbol{Q}(\boldsymbol{\theta})^{-1}\right) \\ \boldsymbol{\eta}(\boldsymbol{u}) &= \boldsymbol{A}\boldsymbol{u} \\ \boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta} & \sim p(\boldsymbol{y}|\boldsymbol{\eta}(\boldsymbol{u}),\boldsymbol{\theta}) \end{aligned} $$ where typically each linear predictor element, ηi(u), is linked to a location parameter of the distribution for observation yi, for each i, via a (non-linear) link function g−1(⋅). In the R-INLA implementation, the observations are assumed to be conditionally independent, given η and θ.

Approximate INLA for non-linear predictors

The premise for the inlabru method for non-linear predictors is to build on the existing implementation, and only add a linearisation step. The properties of the resulting approximation will depend on the nature of the non-linearity.

Let $\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})$ be a non-linear predictor, i.e. a deterministic function of u, $$ \widetilde{\boldsymbol{\eta}} (\boldsymbol{u}) = \textsf{fcn} (\boldsymbol{u}), $$ and let $\overline{\boldsymbol{\eta}}(\boldsymbol{u})$ be the 1st order Taylor approximation at u0, $$ \overline{\boldsymbol{\eta}}(\boldsymbol{u}) = \widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_0) + \boldsymbol{B}(\boldsymbol{u} - \boldsymbol{u}_0) = \left[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_0) - \boldsymbol{B}\boldsymbol{u}_0\right] + \boldsymbol{B}\boldsymbol{u} , $$ where B is the derivative matrix for the non-linear predictor, evaluated at u0. Hence, we define $$ \begin{aligned} \boldsymbol{y} | \boldsymbol{u}, {\boldsymbol{\theta}} &\overset{d}{=} \boldsymbol{y} | \widetilde{\boldsymbol{\eta}}(\boldsymbol{u}), {\boldsymbol{\theta}} \\ &\sim p (\boldsymbol{y} | g^{-1}[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})], {\boldsymbol{\theta}})\\ \end{aligned} $$ The non-linear observation model $p(\boldsymbol{y}|g^{-1}[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})],\boldsymbol{\theta})$ is approximated by replacing the non-linear predictor with its linearisation, so that the linearised model is defined by

$$ \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) = p(\boldsymbol{y}|\overline{\boldsymbol{\eta}}(\boldsymbol{u}),\boldsymbol{\theta}) = p(\boldsymbol{y}|g^{-1}[\overline{\boldsymbol{\eta}}(\boldsymbol{u})],\boldsymbol{\theta}) \approx p(\boldsymbol{y}|g^{-1}[\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})],\boldsymbol{\theta}) = p(\boldsymbol{y}|\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}),\boldsymbol{\theta}) = \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) $$ The non-linear model posterior is factorised as (θ, u|y) = (θ|y)(u|y, θ), and the linear model approximation is factorised as $$ \overline{p}(\boldsymbol{\theta},\boldsymbol{u}|\boldsymbol{y}) = \overline{p}(\boldsymbol{\theta}|\boldsymbol{y})\overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta}). $$

Fixed point iteration

The remaining step of the approximation is how to choose the linearisation point u*. For a given linearisation point v, INLA will compute the posterior mode for θ, $$ \widehat{\boldsymbol{\theta}}_{\boldsymbol{v}} = \mathop{\mathrm{arg\,max}}_{\boldsymbol{\theta}} \overline{p}_\boldsymbol{v} ( {\boldsymbol{\theta}} | \boldsymbol{y} ), $$ and the joint conditional posterior mode for u, $$ \widehat{\boldsymbol{u}}_{\boldsymbol{v}} = \mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} \overline{p}_\boldsymbol{v} ( \boldsymbol{u} | \boldsymbol{y}, \widehat{\boldsymbol{\theta}}_{\boldsymbol{v}} ) . $$

Define the Bayesian estimation functional1 $$ f(\overline{p}_{\boldsymbol{v}}) = (\widehat{\boldsymbol{\theta}}_{\boldsymbol{v}},\widehat{\boldsymbol{u}}_{\boldsymbol{v}}) $$ and let $f(p)=(\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{u}})$ denote the corresponding posterior modes for the true posterior distribution, $$ \begin{aligned} \widehat{{\boldsymbol{\theta}}} &= \mathop{\mathrm{arg\,max}}_{\boldsymbol{\theta}} p ( {\boldsymbol{\theta}} | \boldsymbol{y} ), \\ \widehat{\boldsymbol{u}} &= \mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} p (\boldsymbol{u} | \boldsymbol{y}, \widehat{{\boldsymbol{\theta}}}). \end{aligned} $$

The fixed point $(\boldsymbol{\theta}_*,\boldsymbol{u}_*)=f(\overline{p}_{\boldsymbol{u}_*})$ should ideally be close to $(\widehat{\boldsymbol{\theta}},\widehat{\boldsymbol{u}})$, i.e. close to the true marginal/conditional posterior mode. We can achieve this for the conditional latent mode, so that $\boldsymbol{u}_*=\mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} p (\boldsymbol{u} | \boldsymbol{y}, \widehat{\boldsymbol{\theta}}_{\boldsymbol{u}_*})$.

We therefore seek the latent vector u* that generates the fixed point of the functional, so that $(\boldsymbol{\theta}_*,\boldsymbol{u}_*)=f(\overline{p}_{\boldsymbol{u}_*})$.

One key to the fixed point iteration is that the observation model is linked to u only through the non-linear predictor $\widetilde{\boldsymbol{\eta}}(\boldsymbol{u})$, since this leads to a simplified line search method below.

  1. Let u0 be an initial linearisation point for the latent variables obtained from the initial INLA call. Iterate the following steps for k = 0, 1, 2, ...
  2. Compute the predictor linearisation at u0.
  3. Compute the linearised INLA posterior $\overline{p}_{\boldsymbol{u}_0}(\boldsymbol{\theta}|\boldsymbol{y})$.
  4. Let $(\boldsymbol{\theta}_1,\boldsymbol{u}_1)=(\widehat{\boldsymbol{\theta}}_{\boldsymbol{u}_0},\widehat{\boldsymbol{u}}_{\boldsymbol{u}_0})=f(\overline{p}_{\boldsymbol{u}_0})$ be the initial candidate for new linearisation point.
  5. Let vα = (1 − α)u1 + αu0, and find the value α minimises $\|\widetilde{\eta}(\boldsymbol{v}_\alpha)-\overline{\eta}(\boldsymbol{u}_1)\|$.
  6. Set the new linearisation point u0 equal to vα and repeat from step 1, unless the iteration has converged to a given tolerance.

A potential improvement of step 4 might be to also take into account the prior distribution for u as a minimisation penalty, to avoid moving further than would be indicated by a full likelihood optimisation.

Posterior non-linearity checks

Whereas the inlabru optimisation method leads to an estimate where $\| \widetilde{\boldsymbol{\eta}} (\boldsymbol{u}_*) - \overline{\boldsymbol{\eta}}(\boldsymbol{u}_*)\|=0$ for a specific u*, the overall posterior approximation accuracy depends on the degree of nonlinearity in the vicinity of u*. There are two main options for evaluating this nonlinearity, using sampling from the approximate posterior distribution. The first option is $$ \begin{aligned} \sum_i \frac{E_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y})}\left[ |\overline{\boldsymbol{\eta}}_i(\boldsymbol{u})-\widetilde{\boldsymbol{\eta}}_i(\boldsymbol{u})|^2\right]}{\mathrm{Var}_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y})}(\overline{\boldsymbol{\eta}}_i(\boldsymbol{u}))} , \end{aligned} $$ which is the posterior expectation of the component-wise variance-normalised squared deviation between the non-linear and linearised predictor. Note that the normalising variance includes the variability induced by the posterior uncertainty for θ, whereas the ∥ ⋅ ∥V norm used for the line search used only the posterior mode. Another option is $$ E_{(\boldsymbol{u},\boldsymbol{\theta})\sim \overline{p}(\boldsymbol{u},\boldsymbol{\theta}|\boldsymbol{y})} \left[\ln \frac{\overline{p}(\boldsymbol{u} |\boldsymbol{y},{\boldsymbol{\theta}})}{\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},{\boldsymbol{\theta}})}\right] = E_{\boldsymbol{\theta}\sim \overline{p}(\boldsymbol{\theta}|\boldsymbol{y})} \left\{ E_{\boldsymbol{u}\sim \overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})} \left[\ln \frac{\overline{p}(\boldsymbol{u} |\boldsymbol{y},{\boldsymbol{\theta}})}{\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},{\boldsymbol{\theta}})}\right] \right\} $$ which is the Kullback–Leibler divergence for the conditional posterior densities, $\mathsf{KL}\left(\overline{p}\,\middle\|\,\widetilde{p}\right)$, integrated over the approximate posterior distribution for θ. Implementing this would require access to the likelihood and prior distribution details. The next section explores this in more detail.

Accuracy

We wish to assess how accurate the approximation is. Thus, we compare (u|y, θ) and $\overline{p}(\boldsymbol{u} |\boldsymbol{y},\boldsymbol{\theta})$. With Bayes’ theorem, $$ \begin{aligned} p(\boldsymbol{u}|\boldsymbol{y},{\boldsymbol{\theta}}) &= \frac{p(\boldsymbol{u},\boldsymbol{y}|{\boldsymbol{\theta}})}{p(\boldsymbol{y}|{\boldsymbol{\theta}})} \\ &= \frac{p(\boldsymbol{y}|\boldsymbol{u},{\boldsymbol{\theta}}) p(\boldsymbol{u}|{\boldsymbol{\theta}})}{p(\boldsymbol{y}|{\boldsymbol{\theta}})}, \end{aligned} $$ where p(u|θ) is a Gaussian density and p(y|θ) is a scaling factor that doesn’t depend on u. We can therefore focus on the behaviour of ln p(y|θ, u) for the exact and linearised observation models.

Recall that the observation likelihood only depends on u through η. Using a Taylor expansion with respect to η and $\boldsymbol{\eta}^*=\widetilde{\boldsymbol{\eta}}(\boldsymbol{u}_*)$, $$ \begin{aligned} \ln p(\boldsymbol{y}|\boldsymbol{\eta},\boldsymbol{\theta}) &= \ln p (\boldsymbol{y}|{\boldsymbol{\theta}},\boldsymbol{\eta}^*)) \\ &\qquad + \sum_i \left.\frac{\partial}{\partial\eta_i} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*}\cdot (\eta_i - \eta^*_i) \\ &\qquad + \frac{1}{2}\sum_{i,j} \left.\frac{\partial^2}{\partial\eta_i\partial\eta_j} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*}\cdot (\eta_i - \eta^*_i) (\eta_j - \eta^*_j) + \mathcal{O}(\|\boldsymbol{\eta}-\boldsymbol{\eta}^*\|^3), \end{aligned} $$ Similarly, for each component of $\widetilde{\boldsymbol{\eta}}$, $$ \begin{aligned} \widetilde{\eta}_i(\boldsymbol{u}) &= \eta^*_i +\left[\left.\nabla_{u}\widetilde{\eta}_i(\boldsymbol{u})\right|_{\boldsymbol{u}_*}\right]^\top (\boldsymbol{u} - \boldsymbol{u}_*) \\&\quad +\frac{1}{2}(\boldsymbol{u} - \boldsymbol{u}_*)^\top\left[\left.\nabla_{u}\nabla_{u}^\top\widetilde{\eta}_i(\boldsymbol{u})\right|_{\boldsymbol{u}_*}\right] (\boldsymbol{u} - \boldsymbol{u}_*) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}^*\|^3) \\&= \eta_i^* + b_i(\boldsymbol{u}) + h_i(\boldsymbol{u}) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \\&= \overline{\eta}_i(\boldsymbol{u}) + h_i(\boldsymbol{u}) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \end{aligned} $$

where uu is the Hessian with respect to u, bi are linear in u, and hi are quadratic in u. Combining the two expansions and taking the difference between the full and linearised log-likelihoods, we get $$ \begin{aligned} \ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) &= \sum_i \left.\frac{\partial}{\partial\eta_i} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*}\cdot h_i(\boldsymbol{u}) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \end{aligned} $$ Note that the log-likelihood Hessian difference contribution only involves third order u terms and higher, so the expression above includes all terms up to second order.

Let $$ g_i^*=\left.\frac{\partial}{\partial\eta_i} \ln p (\boldsymbol{y} | {\boldsymbol{\theta}}, \boldsymbol{\eta}) \right|_{\boldsymbol{\eta}^*} $$ and Hi* = ∇uuη̃i(u)|u*. and form the sum of their products, G = ∑igi*Hi*. Then $$ \begin{aligned} \ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) &= \frac{1}{2} \sum_i g_i^* (\boldsymbol{u}-\boldsymbol{u}_*)^\top \boldsymbol{H}_i^* (\boldsymbol{u}-\boldsymbol{u}_*) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3) \\&= \frac{1}{2} (\boldsymbol{u}-\boldsymbol{u}_*)^\top \boldsymbol{G} (\boldsymbol{u}-\boldsymbol{u}_*) + \mathcal{O}(\|\boldsymbol{u}-\boldsymbol{u}_*\|^3). \end{aligned} $$ With $\boldsymbol{m}=\mathsf{E}_\overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})$ and $\boldsymbol{Q}^{-1}=\mathsf{Cov}_\overline{p}(\boldsymbol{u},\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})$, we obtain $$ \begin{aligned} \mathsf{E}_{\overline{p}}\left[ \nabla_{\boldsymbol{u}} \left\{\ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})\right\}\right] &\approx \boldsymbol{G}(\boldsymbol{m}-\boldsymbol{u}_*) , \\ \mathsf{E}_{\overline{p}}\left[ \nabla_{\boldsymbol{u}}\nabla_{\boldsymbol{u}}^\top \left\{\ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})\right\}\right] &\approx \boldsymbol{G} , \\ \mathsf{E}_{\overline{p}}\left[ \ln \widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta}) - \ln \overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})\right] &\approx \frac{1}{2} \mathop{\mathrm{tr}}(\boldsymbol{G}\boldsymbol{Q}^{-1}) + \frac{1}{2} (\boldsymbol{m}-\boldsymbol{u}_*)\boldsymbol{G}(\boldsymbol{m}-\boldsymbol{u}_*)^\top . \end{aligned} $$ For each θ configuration in the INLA output, we can extract both m and the sparse precision matrix Q for the Gaussian approximation. The non-sparsity structure of G is contained in the non-sparsity of Q, which allows the use of Takahashi recursion (inla.qinv(Q)) to compute the corresponding Q−1 values needed to evaluate the trace tr (GQ−1). Thus, to implement a numerical approximation of this error analysis only needs special access to the log-likelihood derivatives gi*, as Hi* can in principle be evaluated numerically.

For a given θ, $$ \begin{aligned} \mathsf{KL}\left(\overline{p}\,\middle\|\,\widetilde{p}\right) &= E_{\overline{p}}\left[\ln\frac{\overline{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})}{\widetilde{p}(\boldsymbol{u}|\boldsymbol{y},\boldsymbol{\theta})}\right] \\&= E_{\overline{p}}\left[ \ln\frac{\overline{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})}{\widetilde{p}(\boldsymbol{y}|\boldsymbol{u},\boldsymbol{\theta})} \right] - \ln\frac{\overline{p}(\boldsymbol{y}|\boldsymbol{\theta})}{\widetilde{p}(\boldsymbol{y}|\boldsymbol{\theta})} . \end{aligned} $$ The first term was approximated above. The second term can also be approximated using the derived quantities (to be continued…).

Summary: The form of the observation likelihood discrepancy shows that, given a linearised posterior N(m, Q−1), a Gaussian approximation to the nonlinear model posterior, $\mathsf{N}(\widetilde{\boldsymbol{m}},\widetilde{\boldsymbol{Q}}^{-1})$, can be obtained from $\widetilde{\boldsymbol{Q}}=\boldsymbol{Q}-\boldsymbol{G}$ and $\widetilde{\boldsymbol{Q}}\widetilde{\boldsymbol{m}}=\boldsymbol{Q}\boldsymbol{m}-\boldsymbol{G}\boldsymbol{u}_*$. The K-L divergence becomes $$ \begin{aligned} \mathsf{KL}\left(\overline{p}\,\middle\|\,\widetilde{p}\right) &\approx \frac{1}{2} \left[ \ln\det(\boldsymbol{Q})- \ln\det(\boldsymbol{Q}-\boldsymbol{G}) -\mathop{\mathrm{tr}}\left(\boldsymbol{G}\boldsymbol{Q}^{-1}\right) + (\boldsymbol{m}-\boldsymbol{u}_*)^\top\boldsymbol{G}(\boldsymbol{Q}-\boldsymbol{G})^{-1}\boldsymbol{G}(\boldsymbol{m}-\boldsymbol{u}_*) \right] . \end{aligned} $$ When the INLA posterior has mean m = u*, e.g. for models with additive Gaussian observation noise, and $\boldsymbol{\theta}=\widehat{\boldsymbol{\theta}}_{\boldsymbol{u}_*}$, the last term vanishes.

Note: by implementing the K-L divergence accuracy metric, a by-product would be improved posterior estimates based on $\widetilde{\boldsymbol{m}}$ and $\widetilde{\boldsymbol{Q}}$.

Well-posedness and initialisation

On a side note, one might be concerned about initialisation at, or convergence to, a saddle point. Although it is not implemented in inlabru, we want to talk about the technicality how we define the initial linearisation point u0.

Generally speaking, any values of u0 work except the case that the gradient evaluated at u0 is 0 because the linearisation point will never move away if the prior mean is also 0. In general, this tends to be a saddle point problem. In some cases the problem can be handled by changing the predictor parameterisation or just changing the initialisation point using the bru_initial option. However, for true saddle point problems, it indicates that the predictor parameterisation may lead to a multimodal posterior distribution or is ill-posed in some other way. This is a more fundamental problem that cannot be fixed by changing the initialisation point.

In these examples, where β and u are latent Gaussian components, the predictors 1, 3, and 4 would typically be safe, but predictor 2 is fundamentally non-identifiable. $$ \begin{aligned} \boldsymbol{\eta}_1 &= \boldsymbol{u}, \\ \boldsymbol{\eta}_2 &= \beta \boldsymbol{u}, \\ \boldsymbol{\eta}_3 &= e^\beta \boldsymbol{u}, \\ \boldsymbol{\eta}_4 &= F_\beta^{-1} ( \Phi(z_\beta)) \boldsymbol{u}, \quad z_{\beta} \sim \mathsf{N}(0,1) . \end{aligned} $$ Note that for η3 and η4, the partial derivatives with respect to β are zero for u = 0. However, the first inlabru iteration will give a non-zero estimate of u, so that subsequent iteration will involve both β and u.


  1. Potential other choices for f(⋅) include the posterior expectation $\overline{E}(\boldsymbol{u}|\boldsymbol{y})$ and the marginal conditional modes, $$ \left\{\mathop{\mathrm{arg\,max}}_{u_i} \overline{p}_{\boldsymbol{v}}(u_i|\boldsymbol{y}),\,i=1,\dots,n\right\}, $$ which was used in inlabru up to version 2.1.15, which caused problems for some nonlinear models. From version 2.2.3, the joint conditional mode is used, $$ \mathop{\mathrm{arg\,max}}_{\boldsymbol{u}} \overline{p}_{\boldsymbol{v}}(\boldsymbol{u}|\boldsymbol{y},\widehat{\boldsymbol{\theta}}_{\boldsymbol{v}}), $$ where $\widehat{\boldsymbol{\theta}}=\mathop{\mathrm{arg\,max}}_{\boldsymbol{\theta}} \overline{p}_{\boldsymbol{v}}(\boldsymbol{\theta}|\boldsymbol{y})$.↩︎