Alternatives: Prior distributions for GP hyperparameters

Overview

In the fully Bayesian approach to emulating a complex simulator, a Gaussian process (GP) is formulated to represent prior knowledge of the simulator. The GP specification is conditional on some hyperparameters, as discussed in the alternatives pages for emulator prior mean function (AltMeanFunction), and correlation function (AltCorrelationFunction) and the discussion page on the GP covariance function (DiscCovarianceFunction). Specifically, in the core problem that is the subject of the core threads using Gaussian process methods (ThreadCoreGP) or Bayes linear (ThreadCoreBL) a vector \(\beta\) defines the detailed form of the mean function, a scalar \(\sigma^2\) quantifies the uncertainty or variability of the simulator around the prior mean function, while \(\delta\) is a vector of hyperparameters defining details of the correlation function. Threads that deal with variations on the basic core problem may introduce further hyperparameters.

In particular in the multi output variant (see the thread on the analysis of a simulator with multiple outputs using Gaussian process methods (ThreadVariantMultipleOutputs)) there are a range of possible parameterisations which require different (more general) prior specifications. The univariate case is typically the simplification to scalar settings. For example if a input-output separable multi output emulator is used, then the prior specification will be over a matrix \(\beta\), and a matrix \(\Sigma\) and a set of correlation function hyperparameters \(\delta\). Priors for this case are discussed in the alternatives page on prior distributions for multivariate GP hyperparameters (AltMultivariateGPPriors).

A fully Bayesian analysis requires hyperparameters to be given prior distributions. We consider here alternative ways to specify prior distributions for the hyperparameters of the core problem. Prior distributions for other hyperparameters are addressed in the relevant variant thread. Hyperparameters may be handled differently in the Bayes Linear approach - see ThreadCoreBL.

Choosing the Alternatives

The prior distributions should be chosen to represent whatever prior knowledge the analyst has about the hyperparameters. However, the prior distributions will be updated with the information from a set of training runs, and if there is substantial information in the training data about one or more of the hyperparameters then the prior information about those hyperparameters may be essentially irrelevant.

In general we require a joint distribution \(\pi(\beta,\sigma^2,\delta)\) for all the hyperparamters. Where required, we will denote the marginal distribution of \(\delta\) by \(\pi_{\delta}(\delta)\), and similarly for marginal distributions of other groups of hyperparameters.

The Nature of the Alternatives

Priors for \(\sigma^2\)

In most applications, there will be plenty of information about \(\sigma^2\) in the training data. We typically have at least 100 training runs, and 100 observations would usually be considered adequate to estimate a variance. Unless there is strong prior information available regarding this hyperparameter, it would be acceptable to use the conventional weak prior specification

\[\pi_{\sigma^2}(\sigma^2) \propto \sigma^{-2}\]

independently of the other hyperparameters.

In situations where the training data are more sparse, which may arise for instance when the simulator is computationally demanding, prior information about \(\sigma^2\) may make an important contribution to the analysis. Genuine prior information about \(\sigma^2\) in the form of a proper prior distribution should be specified by a process of elicitation - see references at the end of this page. See also the discussion of conjugate prior distributions below.

Priors for \(\beta\)

Again, we would expect to find that in most applications there is enough evidence in the training data to identify \(\beta\) well, particularly when the mean function is specified in the linear form, so that the elements of \(\beta\) are regression parameters. Then it is acceptable to use the conventional weak prior specification

\[\pi_{\beta}(\beta) \propto 1\]

independently of the other hyperparameters.

If there is a wish to express genuine prior information about \(\beta\) in the form of a proper prior distribution, then this should be specified by a process of elicitation - see references at the end of this page. See also the discussion of conjugate prior distributions below.

Conjugate priors for \(\beta\) and \(\sigma^2\)

When substantive prior information exists and is to be specified for either \(\beta\) or \(\sigma^2\), then it is convenient to use conjugate prior distributions if feasible.

If prior information is to be specified for \(\sigma^2\) alone (with the weak prior specification adopted for \(\beta\)), the conjugate prior family is the inverse gamma family. This can be elicited using the SHELF package referred to at the end of this page.

If \(\beta\) is a vector of regression parameters in a linear form of mean function, and prior information is to be specified about both \(\beta\) and \(\sigma^2\), then the conjugate prior family is the normal inverse gamma family. Specifying such a distribution is a complex business - see the reference to Oakley (2002) at the end of this page.

Although these conjugate prior specifications make subsequent updating using the training data as simple as in the case of weak priors, the details are not given in the MUCM toolkit because it is expected that weak priors for \(\beta\) and \(\sigma^2\) will generally be used. If needed, information on using conjugate priors in building an emulator can be found in the Oakley (2002) reference at the end of this page. (The case of an inverse gamma prior for \(\sigma^2\) combined with a weak prior for \(\beta\) is a special case of the general normal inverse gamma prior.)

If prior information is to be specified for \(\beta\) alone, the conjugate prior family is the normal family, but for full conjugacy the variance of \(\beta\) should be proportional to \(\sigma^2\) in the same way as is found in the normal inverse gamma family. This seems unrealistic when weak prior information is to be specified for \(\sigma^2\), and so we do not discuss this conjugate option further.

Priors for \(\delta\)

The situation with \(\delta\) is quite different, in that the information in the training data will often fail to identify these hyperparameters to sufficient accuracy. It can therefore be difficult to estimate their values effectively, and inappropriate estimates can lead to poor emulation. Also, it is known that in the case of a Gaussian or an exponential power correlation function (see the alternatives page on emulator prior correlation function (AltCorrelationFunction)) conventional weak prior distributions may lead to improper posterior distributions. See the alternatives page on estimators of correlation hyperparameters (AltEstimateDelta).

Accordingly, prior information about \(\delta\) can be very useful, or even essential, in obtaining a valid emulator. Fortunately, genuine prior information about \(\delta\) often exists.

Consider the case of a Gaussian correlation function, where the elements of \(\delta\) are correlation lengths which define the smoothness of the simulator output as each input is varied. The experience of the users and developers of the simulator may suggest how stable the output should be in response to varying individual inputs. In particular, it may be possible to specify a range for each input such that it is not expected the output will “wiggle” (relative to the mean function). For other forms of correlation function, there may again be genuine prior information about the smoothness and/or regularity that is controlled by various elements of \(\delta\). We suggest that even quite crudely elicited distributions for these parameters will be better than adopting any default priors.