Simulation-Based Inference: A Practical Introduction

1. Introduction and Motivation

Simulation-based inference (SBI), also known as likelihood-free inference or implicit likelihood inference, is a modern approach to Bayesian inference that sidesteps the need for tractable likelihoods by learning statistical relationships directly from simulations.

Bayes theorem states:

\[p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)}\]

where \(p(\theta)\) is the prior, \(p(x | \theta)\) is the likelihood, and \(p(x) = \int p(x | \theta) p(\theta) d\theta\) is the evidence.

In many scientific domains we often have a simulator that can generate data, \(x \sim p(x | \theta)\), and a prior over our model parameters, \(\theta \sim p(\theta)\), but no tractable likelihood function, \(p(x | \theta)\). In SED fitting, this simulator corresponds to codes that can generate realistic astrophysical spectra given some model parameters. However, in most cases these codes do not provide a flexible likelihood function, \(p({\rm spectrum} | \theta)\), since this would require integrating over the complex physics of the simulator, which is intractable.

SBI leverages these simulations to learn the statistical relationship between parameters and data using machine learning methods. Instead of deriving likelihoods analytically, SBI takes many simulations: \((θ_i, x_i)\) where \(\theta_i \sim p(\theta)\), \(x_i \sim p(x|\theta_i)\), and uses neural networks to learn probability distributions from these samples. These learned distributions can be used to perform inference.

2. Mathematical Framework

2.1 The Joint Distribution

We start with the joint distribution over parameters and data:

\[p(\theta, x) = p(x | \theta) p(\theta)\]

From simulation, we can sample from this joint distribution:

  • Sample \(\theta \sim p(\theta)\)

  • Sample \(x \sim p(x|\theta)\) using the simulator

This gives us a dataset: \(\mathcal{D} = \{(\theta_i, x_i)\}_{i=1}^N\)

There are then three flavours of SBI that can leverage these simulations for posterior inference.

2.2 Neural Posterior Estimation (NPE)

Neural Posterior Estimation (NPE) targets the posterior directly, and learns \(p(\theta \| x)\) directly by training a conditional density estimator \(q_\phi(\theta | x)\) to minimize:

\[\mathcal{L}_{\text{NPE}}(\phi) = -\mathbb{E}_{(\theta, x) \sim p(\theta, x)}[\log q_\phi(\theta | x)]\]

This is the negative log-likelihood of the training data under our neural density estimator, \(q_\phi^*(\theta | x) \approx p(\theta | x)\), where \(\phi^*\) are the optimal parameters.

2.3 Neural Likelihood Estimation (NLE)

Neural Likelihood Estimation (NLE) targets the likelihood, and learns \(p(x \| \theta)\), where \(q_\phi(x | \theta)\) is a conditional density estimator trained to minimize:

\[\mathcal{L}_{\text{NLE}}(\phi) = -\mathbb{E}_{(\theta, x) \sim p(\theta, x)}[\log q_\phi(x | \theta)]\]

Once we have \(q_\phi(x | \theta) \approx p(x | \theta)\), we can use it in MCMC or SMC to sample from \(p(\theta | x_o) \propto q_\phi(x_o | \theta) p(\theta)\).

2.4 Neural Ratio Estimation (NRE)

Neural Ratio Estimation (NRE) aims to approximate the likelihood-to-evidence ratio, \(r(x,\theta) = \frac{p(x\|\theta)}{p(x)}\), using a binary classification trick. If we define

\[r(x, \theta) = \frac{p(x | \theta)}{p(x)} = \frac{p(x, \theta)}{p(\theta)p(x)}\]

We then train a binary classifier \(d_\phi(x, \theta)\) to distinguish between:

  • Positive class: \((x, \theta)\) drawn from joint \(p(x, \theta) = p(x|\theta)p(\theta)\)

  • Negative class: \((x, \theta)\) drawn from marginals \(p(x)p(\theta)\)

The optimal classifier satisfies:

\[d_\phi^*(x, \theta) = \frac{p(x, \theta)}{p(x, \theta) + p(x)p(\theta)} = \frac{r(x, \theta)}{1 + r(x, \theta)}\]

Therefore:

\[r(x, \theta) = \frac{d_\phi^*(x, \theta)}{1 - d_\phi^*(x, \theta)}\]

And we can compute:

\[p(\theta | x_o) \propto r(x_o, \theta) p(\theta) = \frac{d_\phi(x_o, \theta)}{1 - d_\phi(x_o, \theta)} p(\theta)\]

3. Neural Density Estimators

The key to SBI is using flexible neural networks to represent probability distributions. Galaxy SED fitting posteriors can be challenging: strong degeneracies exist between age and metallicity, between dust and redshift, and multimodal posteriors can occur when multiple SFH solutions fit equally well to the data. This is why the choice of density estimator matters. We list here some of the most common architectures.

3.1 Mixture Density Networks (MDN)

A neural network outputs the parameters of a mixture distribution:

\[q_\phi(\theta | x) = \sum_{k=1}^K \pi_k(x) \mathcal{N}(\theta | \mu_k(x), \Sigma_k(x))\]

where a neural network \(f_\phi(x)\) outputs \(\{\pi_k, \mu_k, \Sigma_k\}\) for each mixture component. MDNs are simple, fast, and can represent multimodal distributions, but suffer from limited flexibility, and can fail for highly dimensional problems.

3.2 Normalizing Flows

A normalizing flow transforms a simple base distribution through a series of invertible transformations:

\[\theta = T_\phi(z, x), \quad z \sim p(z)\]

The density is given by the change of variables formula:

\[q_\phi(\theta | x) = p(z) \left| \det \frac{\partial T_\phi^{-1}(\theta, x)}{\partial \theta} \right|\]

A number of common flow architectures exist, including Masked Autoregressive Flows (MAF), Neural Spline Flows (NSF), and Continuous Normalizing Flows (CNF) that leverage neural ODEs. Normalizing flows provide exact density evaluation and sampling, and are very flexible, but can be relatively computationall expensive, and more sensitive to architecture choices.

3.3 Autoregressive Models

Decompose the joint density using the chain rule:

\[q_\phi(\theta | x) = \prod_{i=1}^d q_\phi(\theta_i | \theta_{<i}, x)\]

Each conditional is modeled by a neural network.

4. Sequential vs. Amortized Inference

In amortized inference, you train once on many simulations, then apply to any observation.

\[(\theta_i, x_i) \sim p(\theta, x) \text{ for } i=1,\ldots,N\]

Train \(q_\phi(\theta | x)\) on this dataset. Then for any new observation \(x_o\), directly evaluate \(q_\phi(\theta | x_o)\). This leads to very rapid inference on new data, but may be inaccurate far from the original training distribution.

In sequential inference, you iteratively focus simulations near observed data over multiple rounds. In the first round, you sample from your prior, and train an initial approximation \(q_\phi^{(1)}\). In subsequenbt rounds, \(t\), you sample \(\theta \sim q_\phi^{(t-1)}(\cdot | x_o)\) (focus near posterior), simulate \(x \sim p(x|\theta)\), then retrain to get \(q_\phi^{(t)}\).

Sequential inference is much more simulation-efficient, and often provides better accuracy, but new simulations must be generated for each new observation.

5. Comparison with Traditional Methods

SBI has a number of key advantages over traditional methods. It works with implicit likelihoods; there is no need to derive complex probability densities. Amortization means that you can train once, and apply to many observations cheaply (for NPE). It is scalable, since modern neural networks can handle high-dimensional data. And it does not require hand-crafted distance metrics (unlike ABC).

However, SBI also has a number of challenges. Clear diagnostics are required to evaluate the accuracy of the learnt relationships, as well as convergence. SBI also still requires many simulations for training, and is susceptible to extrapolation errors outside the training data distribution. Finally, stochastic simulators require special handling for variance.

We summarise these pros and cons compared to traditional approaches below.

Method

Requires Likelihood

Flexibility

Simulation Efficiency

MCMC

Yes

High

N/A

Nested Sampling

Yes

High

Medium

Grid-based SED fitting

Yes*

Low

N/A

ABC

No

Medium

Low (many rejections)

NPE

No

High

High (amortized)

NLE + MCMC

No

High

Medium

NRE

No

High

Medium