Procedure: Build Gaussian process emulator of derivatives

Description and Background

The preparation for building a Gaussian process (GP) emulator of derivatives involves defining the prior mean and covariance functions, identifying prior distributions for hyperparameters, creating a design for the training sample, then running the adjoint, or simulator, at the input configurations specified in the design. This is described in the generic thread on methods to emulate derivatives (ThreadGenericEmulateDerivatives). The procedure here is for taking those various ingredients and creating the GP emulator.

Additional notation for this page

Derivative information requires further notation than is specified in the Toolkit notation page (MetaNotation). As in the procedure page on building a GP emulator with derivative information (ProcBuildWithDerivsGP) we use the following additional notation:

  • The tilde symbol (\(\tilde{}\)) placed over a letter denotes derivative information and function output combined.
  • We introduce an extra argument to denote a derivative. We define \(\tilde{f}(x,d)\) to be the derivative of \(f(x)\) with respect to input \(d\) and so \(d \in\{0,1,\ldots,p\}\). When \(d=0\) we have \(\tilde{f}(x,0)=f(x)\). For simplicity, when \(d=0\) we adopt the shorter notation so we use \(f(x)\) rather than \(\tilde{f}(x,0)\).
  • An input is denoted by a superscript on \(x\), while a subscript on \(x\) refers to the point in the input space. For example, \(x_i^{(k)}\) refers to input \(k\) at point \(i\).

Inputs

  • GP prior mean function \(m(\cdot)\), differentiable and depending on hyperparameters \(\beta\)
  • GP prior correlation function \(c(\cdot,\cdot)\), twice differentiable and depending on hyperparameters \(\delta\)
  • Prior distribution \(\pi(\cdot,\cdot,\cdot)\) for \(\beta,\sigma^2\) and \(\delta\) where \(\Sigma\) is the process variance hyperparameter
  • Design, \(\tilde{D} = \{(x_k,d_k)\}\), where \(k = \{1,\ldots,\tilde{n}\}\) and \(d_k \in \{0,1,\ldots,p\}\). We have \(x_k\) which refers to the location in the design and \(d_k\) determines whether at point \(x_k\) we require function output or a first derivative w.r.t one of the inputs. Each \(x_k\) is not necessarily distinct as we may have a derivative and the function output at point \(x_k\) or we may require a derivative w.r.t several inputs at point \(x_k\). If we do not have any derivative information, \(d_k=0, \forall k\) and the resulting design is as in the core thread ThreadCoreGP.
  • Output vector is \(\tilde{f}(\tilde{D})=\tilde{f}(x_k,d_k)\) of length \(\tilde{n}\). If we are not including derivatives in the training data, \(d_k=0, \forall k\) and the output vector reduces to \(f(D)=f(x)\) as in ThreadCoreGP.

Outputs

A GP-based emulator in one of the forms discussed in the discussion page DiscGPBasedEmulator.

In the case of general prior mean and correlation functions and general prior distribution:

  • A GP posterior conditional distribution with mean function \(\tilde{m}^*(\cdot)\) and covariance function \(\tilde{v}^*(\cdot,\cdot)\) conditional on \(\theta=\{\beta,\sigma^2,\delta\}\).
  • A posterior representation for \(\theta\)

In the case of linear mean function, general correlation function, weak prior information on \(\beta,\sigma^2\) and general prior distribution for \(\delta\):

  • A t process posterior conditional distribution with mean function \(\tilde{m}^*(\cdot)\), covariance function \(\tilde{v}^*(\cdot,\cdot)\) and degrees of freedom \(b^*\) conditional on \(\delta\)
  • A posterior representation for \(\delta\)

As explained in DiscGPBasedEmulator, the “posterior representation” for the hyperparameters is formally the posterior distribution for those hyperparameters, but for computational purposes this distribution is represented by a sample of hyperparameter values. In either case, the outputs define the emulator and allow all necessary computations for tasks such as prediction of the partial derivatives of the simulator output w.r.t the inputs, uncertainty analysis or sensitivity analysis.

Procedure

General case

We define the following arrays (following the conventions set out in MetaNotation where possible).

\(\tilde{e}=\tilde{f}(\tilde{D})-\tilde{m}(\tilde{D})\), an \(\tilde{n}\times 1\) vector, where \(\tilde{m}(\tilde{D}) = \frac{\partial}{\partial x^{(d_k)}}m(x_k)\).

\(\tilde{A}=\tilde{c}(\tilde{D},\tilde{D}),\) an \(\tilde{n}\times \tilde{n}\) matrix, where \(\tilde{c}(\cdot,\cdot)\) includes the covariances involving derivatives. The exact form of \(\tilde{c}(\cdot,\cdot)\) depends on where derivatives are included. The general expression for this is: \(\tilde{c}(\cdot,\cdot) = {\rm Corr}\{\tilde{f}(x_i,d_i),\tilde{f}(x_j,d_j)\}\) and we can break it down into three cases:

  • Case 1 is for when \(d_i=d_j=0\) and as such represents the covariance between 2 points. This is the same as in ThreadCoreGP and is given by:

    \[{\rm Corr}\{\tilde{f}(x_i,0),\tilde{f}(x_j,0)\} = c(x_i,x_j).\]
  • Case 2 is for when \(d_i\ne 0\) and \(d_j=0\) and as such represents the covariance between a derivative and a point. This is obtained by differentiating \(c(\cdot,\cdot)\) w.r.t input \(d_i\):

    \[{\rm Corr}\{\tilde{f}(x_i,d_i),\tilde{f}(x_j,0)\} = \frac{\partial c(x_i,x_j)}{\partial x_i^{(d_i)}}, {\rm for}\; d_i\ne 0.\]
  • Case 3 is for when \(d_i\ne 0\) and \(d_j\ne 0\) and as such represents the covariance between two derivatives. This is obtained by differentiating \(c(\cdot,\cdot)\) twice: once w.r.t input \(d_i\) and once w.r.t input \(d_j\):

    \[{\rm Corr}\{\tilde{f}(x_i,d_j),\tilde{f}(x_j,d_j)\} = \frac{\partial^2 c(x_i,x_j)}{\partial x_i^{(d_i)} \partial x_j^{(d_j)}}, {\rm for}\; d_i,d_j\ne0.\]
    • Case 3a. If \(d_i,d_j\ne 0\) and \(d_i=d_j\) we have a special version of Case 3 which gives:

      \[{\rm Corr}\{\tilde{f}(x_i,d_i),\tilde{f}(x_j,d_i)\} = \frac{\partial^2 c(x_i,x_j)}{\partial x_i^{(d_i)},x_j^{(d_i)}}, {\rm for}\; d_i\ne0.\]

      \(\tilde{t}(x,d)=\tilde{c}\{\tilde{D},(x,d)\}\), an \(\tilde{n}\times 1\) vector function of \(x\). We have \(d\ne0\) as here we want to emulate derivatives. To emulate function output, \(d=0\) and this is covered in ThreadCoreGP or ThreadVariantWithDerivatives if we have derivatives in the training data.

Then, conditional on \(\theta\) and the training sample, the output vector \(\tilde{f}(x,d)\) is a multivariate GP with posterior mean function

\[\tilde{m}^*(x,d) = \tilde{m}(x,d) + \tilde{t}(x,d)^{\rm T} \tilde{A}^{-1} \tilde{e}\]

and posterior covariance function

\[\tilde{v}^*\{(x_i,d_i),(x_j,d_j)\} = \sigma^2 \{\tilde{c}\{(x_i,d_i),(x_j,d_j)\}-\tilde{t}(x_i,d_i)^{\rm T} \tilde{A}^{-1} \tilde{t}(x_j,d_j) \}.\]

This is the first part of the emulator as discussed in DiscGPBasedEmulator. The emulator is completed by a second part formally comprising the posterior distribution of \(\theta\), which has density given by

\[\pi^*(\beta,\sigma^2,\delta) \propto \pi(\beta,\sigma^2,\delta) \times (\sigma^2)^{-\tilde{n}/2}|\tilde{A}|^{-1/2} \times \exp\{-\tilde{e}^{\rm T}\tilde{A}^{-1}\tilde{e}/(2\sigma^2)\}.\]

For the output vector \(\tilde{f}(x,0)=f(x)\) see the procedure page on building a GP emulator for the core problem (ProcBuildCoreGP) or the procedure page for building a GP emulator when we have derivatives in the training data (ProcBuildWithDerivsGP).

Linear mean and weak prior case

Suppose now that the mean function has the linear form \(m(x) = h(x)^{\rm T}\beta\), where \(h(\cdot)\) is a vector of \(q\) known basis functions of the inputs and \(\beta\) is a \(q\times 1\) column vector of hyperparameters. When \(d\ne0\) we therefore have \(\tilde{m}(x,d) = \tilde{h}(x,d)^{\rm T}\beta = \frac{\partial}{\partial x^{(d)}}h(x)^{\rm T}\beta\). Suppose also that the prior distribution has the form \(\pi(\beta,\Sigma,\delta) \propto \sigma^{-2}\pi_\delta(\delta)\), i.e. that we have weak prior information on \(\beta\) and \(\Sigma\) and an arbitrary prior distribution \(\pi_\delta(\cdot)\) for \(\delta\).

Define \(\tilde{A}\) and \(\tilde{t}(x)\) as in the previous case. In addition, define the \(\tilde{n} \times q\) matrix

\[\tilde{H}=[\tilde{h}(x_1,d_1),\ldots,\tilde{h} (x_{\tilde{n}},d_{\tilde{n}})]^{\rm T},\]

the vector

\[\widehat{\beta}=\left( \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}\right)^{-1}\tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{f}(\tilde{D})\]

and the scalar

\[\widehat\sigma^2 = (\tilde{n}-q-2)^{-1}\tilde{f}(\tilde{D})^{\rm T}\left\{\tilde{A}^{-1} - \tilde{A}^{-1} \tilde{H}\left( \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}\right)^{-1}\tilde{H}^{\rm T}\tilde{A}^{-1}\right\} \tilde{f}(\tilde{D}).\]

Then, conditional on \(\delta\) and the training sample, the output vector \(\tilde{f}(x,d)\) is a t process with \(b^*=\tilde{n}-q\) degrees of freedom, posterior mean function

\[\tilde{m}^*(x,d) = \tilde{h}(x,d)^{\rm T}\widehat\beta + \tilde{t}(x,d)^{\rm T} \tilde{A}^{-1} (\tilde{f}(\tilde{D})-\tilde{H}\widehat\beta)\]

and posterior covariance function

\[\begin{split}\tilde{v}^*\{(x_i,d_i),(x_j,d_j)\} = & \widehat\sigma^2\{\tilde{c}\{(x_i,d_i),(x_j,d_j)\} - \tilde{t}(x_i,d_i)^{\rm T} \tilde{A}^{-1} \tilde{t}(x_j,d_j) \\ & + \left( \tilde{h}(x_i,d_i)^{\rm T} - \tilde{t}(x_i,d_i)^{\rm T} \tilde{A}^{-1}\tilde{H} \right) \left( \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}\right)^{-1} \left( \tilde{h}(x_j,d_j)^{\rm T} - \tilde{t}(x_j,d_j)^{\rm T} \tilde{A}^{-1}\tilde{H} \right)^{\rm T} \}.\end{split}\]

This is the first part of the emulator as discussed in DiscGPBasedEmulator. The emulator is formally completed by a second part comprising the posterior distribution of \(\delta\), which has density given by

\[\pi_\delta^*(\delta) \propto \pi_\delta(\delta) \times (\widehat\sigma^2)^{-(\tilde{n}-q)/2}|\tilde{A}|^{-1/2}| \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}|^{-1/2}.\]

In order to derive the sample representation of this posterior distribution for the second part of the emulator, three approaches can be considered.

  1. Exact computations require a sample from the posterior distribution of \(\delta\). This can be obtained by MCMC; a suitable reference can be found below.
  2. A common approximation is simply to fix \(\delta\) at a single value estimated from the posterior distribution. The usual choice is the posterior mode, which can be found as the value of \(\delta\) for which \(\pi^*(\delta)\) is maximised. See the alternatives page AltEstimateDelta for a discussion of alternative estimators.
  3. An intermediate approach first approximates the posterior distribution by a multivariate lognormal distribution and then uses a sample from this distribution; this is described in the procedure page ProcApproxDeltaPosterior.

Each of these approaches results in a set of values (or just a single value in the case of the second approach) of \(\delta\), which allow the emulator predictions and other required inferences to be computed.

Although it represents an approximation that ignores the uncertainty in \(\delta\), approach 2 has been widely used. It has often been suggested that, although uncertainty in these correlation hyperparameters can be substantial, taking proper account of that uncertainty through approach 1 does not lead to appreciable differences in the resulting emulator. On the other hand, although this may be true if a good single estimate for \(\delta\) is used, this is not necessarily easy to find, and the posterior mode may sometimes be a poor choice. Approach 3 has not been used much, but can be recommended when there is concern about using just a single \(\delta\) estimate. It is simpler than the full MCMC approach 1, but should capture the uncertainty in \(\delta\) well.

Additional Comments

We can use this procedure to emulate derivatives whether or not we have derivatives in the training data. Quantities \(\tilde{A}, \tilde{H}, \tilde{f}(\tilde{D}), \tilde{m}(\tilde{D})\) and therefore \(\tilde{e}\), above are taken from ProcBuildWithDerivsGP as they allow for derivatives in the training data, in addition to function output. In the case when we build an emulator with function output only, \(d=0\) for all the training data and these quantities reduce to the same quantities without the tilde symbol (\(\tilde{}\)), as defined in ProcBuildCoreGP. Then to emulate derivatives in the general case, conditional on \(\theta\) and the training sample, the output vector \(\tilde{f}(x,d)\) is a multivariate GP with posterior mean function

\[\tilde{m}^*(x,d) = \tilde{m}(x,d) + \tilde{t}(x,d)^{\rm T} A^{-1} e\]

and posterior covariance function

\[\tilde{v}^*\{(x_i,d_i),(x_j,d_j)\} = \sigma^2 \{\tilde{c}\{(x_i,d_i),(x_j,d_j)\}-\tilde{t}(x_i,d_i)^{\rm T} A^{-1} \tilde{t}(x_j,d_j) \}.\]

To emulate derivatives in the case of a linear mean and weak prior, conditional on \(\delta\) and the training sample, the output vector \(\tilde{f}(x,d)\) is a t process with \(b^*=n-q\) degrees of freedom, posterior mean function

\[\tilde{m}^*(x,d) = \tilde{h}(x,d)^{\rm T}\widehat\beta + \tilde{t}(x,d)^{\rm T} A^{-1} (f(D)-H\widehat\beta)\]

and posterior covariance function

\[\begin{split}\tilde{v}^*\{(x_i,d_i),(x_j,d_j)\} = & \widehat\sigma^2\{\tilde{c}\{(x_i,d_i),(x_j,d_j)\} - \tilde{t}(x_i,d_i)^{\rm T} A^{-1} \tilde{t}(x_j,d_j) \\ & + \left( \tilde{h}(x_i,d_i)^{\rm T} - \tilde{t}(x_i,d_i)^{\rm T} A^{-1}H \right) \left( H^{\rm T} A^{-1} H\right)^{-1} \left( \tilde{h}(x_j,d_j)^{\rm T} - \tilde{t}(x_j,d_j)^{\rm T} A^{-1}H \right)^{\rm T} \}.\end{split}\]