Procedure: Build Gaussian process emulator with derivative information

Description and Background

The preparation for building a Gaussian process (GP) emulator with derivative information involves defining the prior mean and covariance functions, identifying prior distributions for hyperparameters, creating a design for the training sample, then running the adjoint, or simulator and a method to obtain derivatives, at the input configurations specified in the design. All of this is described in the variant thread on emulators with derivative information (ThreadVariantWithDerivatives). The procedure here is for taking those various ingredients and creating the GP emulator.

Additional notation for this page

Including derivative information requires further notation than is specified in the Toolkit notation page (MetaNotation). We declare this notation here but it is only applicable to the derivatives thread variant pages.

  • The tilde symbol (\(\tilde{}\)) placed over a letter denotes derivative information and function output combined.
  • We introduce an extra argument to denote a derivative. We define \(\tilde{f}(x,d)\) to be the derivative of \(f(x)\) with respect to input \(d\) and so \(d \in\{0,1,\ldots,p\}\). When \(d=0\) we have \(\tilde{f}(x,0)=f(x)\). For simplicity, when \(d=0\) we adopt the shorter notation so we use \(f(x)\) rather than \(\tilde{f}(x,0)\).
  • An input is denoted by a superscript on \(x\), while a subscript on \(x\) refers to the point in the input space. For example, \(x_i^{(k)}\) refers to input \(k\) at point \(i\).

Inputs

  • GP prior mean function \(m(\cdot)\), differentiable and depending on hyperparameters \(\beta\)
  • GP prior correlation function \(c(\cdot,\cdot)\), twice differentiable and depending on hyperparameters \(\delta\)
  • Prior distribution \(\pi(\cdot,\cdot,\cdot)\) for \(\beta,\sigma^2\) and \(\delta\) where \(\sigma^2\) is the process variance hyperparameter
  • We require a design. In the core thread ThreadCoreGP the design, \(D\), is an ordered set of points \(D=\{x_1,x_2,\ldots,x_n\}\), where each \(x\) is a location in the input space. Here, we need a design which in addition to specifying the location of the inputs, also determines at which points we require function output and at which points we require first derivatives. We arrange this information in the design \(\tilde{D} = \{(x_k,d_k)\}\), where \(k = \{1,\ldots,\tilde{n}\}\) and \(d_k \in \{0,1,\ldots,p\}\). We have \(x_k\) which refers to the location in the design and \(d_k\) determines whether at point \(x_k\) we require function output or a first derivative w.r.t one of the inputs. Each \(x_k\) is not distinct as we may have a derivative and the function output at point \(x_k\) or we may require a derivative w.r.t several inputs at point \(x_k\).
  • Output vector is \(\tilde{f}(\tilde{D})\) of length \(\tilde{n}\).

Outputs

A GP-based emulator in one of the forms discussed in DiscGPBasedEmulator.

In the case of general prior mean and correlation functions and general prior distribution:

  • A GP posterior conditional distribution with mean function \(m^*(\cdot)\) and covariance function \(v^*(\cdot,\cdot)\) conditional on \(\theta=\{\beta,\sigma^2,\delta\}\). If we want to emulate the derivatives rather than the function output see ProcBuildGPEmulateDerivs.
  • A posterior representation for \(\theta\)

In the case of linear mean function, general correlation function, weak prior information on \(\beta,\sigma^2\) and general prior distribution for \(\delta\):

  • A t process posterior conditional distribution with mean function \(m^*(\cdot)\), covariance function \(v^*(\cdot,\cdot)\) and degrees of freedom \(b^*\) conditional on \(\delta\)
  • A posterior representation for \(\delta\)

As explained in DiscGPBasedEmulator, the “posterior representation” for the hyperparameters is formally the posterior distribution for those hyperparameters, but for computational purposes this distribution is represented by a sample of hyperparameter values. In either case, the outputs define the emulator and allow all necessary computations for tasks such as prediction of the simulator output, uncertainty analysis or sensitivity analysis.

Procedure

General case

We define the following arrays (following the conventions set out in MetaNotation where possible).

\(\tilde{e}=\tilde{f}(\tilde{D})-\tilde{m}(\tilde{D})\), an \(\tilde{n}\times 1\) vector, where \(\tilde{m}(x,0)=m(x)\), and \(\tilde{m}(x,d) = \frac{\partial}{\partial x^{(d)}}m(x)\) if \(d\ne0\).

\(\tilde{A}=\tilde{c}(\tilde{D},\tilde{D}),\) an \(\tilde{n}\times \tilde{n}\) matrix, where the exact form of \(\tilde{c}(\cdot,\cdot)\) depends on where derivatives are included. The general expression for this is: \(\tilde{c}(.,.) = {\rm Corr}\{\tilde{f}(x_i,d_i),\tilde{f}(x_j,d_j)\}\) and we can break it down into three cases:

  • Case 1 is for when \(\strut d_i=d_j=0\) and as such represents the covariance between 2 points. This is the same as in ThreadCoreGP and is given by:

    \[{\rm Corr}\{\tilde{f}(x_i,0),\tilde{f}(x_j,0)\} = c(x_i,x_j).\]
  • Case 2 is for when \(d_i\ne 0\) and \(d_j=0\) and as such represents the covariance between a derivative and a point. This is obtained by differentiating \(c(\cdot,\cdot)\) w.r.t input \(d_i\):

    \[{\rm Corr}\{\tilde{f}(x_i,d_i),\tilde{f}(x_j,0)\} = \frac{\partial c(x_i,x_j)}{\partial x_i^{(d_i)}}, {\rm for}\; d_i\ne 0.\]
  • Case 3 is for when \(d_i\ne 0\) and \(d_j\ne 0\) and as such represents the covariance between two derivatives. This is obtained by differentiating \(c(\cdot,\cdot)\) twice: once w.r.t input \(d_i\) and once w.r.t input \(d_j\):

    \[{\rm Corr}\{\tilde{f}(x_i,d_j),\tilde{f}(x_j,d_j)\} = \frac{\partial^2 c(x_i,x_j)}{\partial x_i^{(d_i)} \partial x_j^{(d_j)}}, {\rm for}\; d_i,d_j\ne0.\]
    • Case 3a. If \(d_i,d_j\ne 0\) and \(d_i=d_j\) we have a special version of Case 3 which gives:

      \[{\rm Corr}\{\tilde{f}(x_i,d_i),\tilde{f}(x_j,d_i)\} = \frac{\partial^2 c(x_i,x_j)}{\partial x_i^{(d_i)},x_j^{(d_i)}}, {\rm for}\; d_i\ne0.\]

\(\tilde{t}(x)=\tilde{c}\{\tilde{D},(x,0)\}\), an \(\tilde{n}\times 1\) vector function of \(x\). We have \(d=0\) as here we want to emulate function output. To emulate derivatives, \(d\ne0\) and this is covered in the generic thread on emulating derivatives (ThreadGenericEmulateDerivatives).

Then, conditional on \(\theta\) and the training sample, the output vector \(\tilde{f}(x,0)=f(x)\) is a multivariate GP with posterior mean function

\[m^*(x) = m(x) + \tilde{t}(x)^{\rm T} \tilde{A}^{-1} \tilde{e}\]

and posterior covariance function

\[v^*(x_i,x_j) = \sigma^2 \{c(x_i,x_j)-\tilde{t}(x_i)^{\rm T} \tilde{A}^{-1} \tilde{t}(x_j) \}.\]

This is the first part of the emulator as discussed in DiscGPBasedEmulator. The emulator is completed by a second part formally comprising the posterior distribution of \(\theta\), which has density given by

\[\pi^*(\beta,\sigma^2,\delta) \propto \pi(\beta,\sigma^2,\delta) \times (\sigma^2)^{-\tilde{n}/2}|\tilde{A}|^{-1/2} \times \exp\{-\tilde{e}^{\rm T}\tilde{A}^{-1}\tilde{e}/(2\sigma^2)\}.\]

For the output vector \(\tilde{f}(x,d)\) with \(d\ne0\) see the procedure page on building an emulator of derivatives (ProcBuildEmulateDerivsGP).

Linear mean and weak prior case

Suppose now that the mean function has the linear form \(m(x) = h(x)^{\rm T}\beta\), where \(h(\cdot)\) is a vector of \(q\) known basis functions of the inputs and \(\beta\) is a \(q\times 1\) column vector of hyperparameters. When \(d\ne0\) we therefore have \(\tilde{m}(x,d) = \tilde{h}(x,d)^{\rm T}\beta = \frac{\partial}{\partial x^{(d)}}h(x)^{\rm T}\beta\). Suppose also that the prior distribution has the form \(\pi(\beta,\sigma^2,\delta) \propto \sigma^{-2}\pi_\delta(\delta)\), i.e. that we have weak prior information on \(\beta\) and \(\sigma^2\) and an arbitrary prior distribution \(\pi_\delta(\cdot)\) for \(\delta\).

Define \(\tilde{A}\) and \(\tilde{t}(x)\) as in the previous case. In addition, define the \(\tilde{n} \times q\) matrix

\[\tilde{H}=[\tilde{h}(x_1,d_1),\ldots,\tilde{h}(x_{\tilde{n}},d_{\tilde{n}})]^{\rm T},\]

the vector

\[\widehat{\beta}=\left( \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}\right)^{-1}\tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{f}(\tilde{D})\]

and the scalar

\[\widehat\sigma^2 = (\tilde{n}-q-2)^{-1}\tilde{f}(\tilde{D})^{\rm T}\left\{\tilde{A}^{-1} - \tilde{A}^{-1} \tilde{H}\left( \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}\right)^{-1}\tilde{H}^{\rm T}\tilde{A}^{-1}\right\} \tilde{f}(\tilde{D}).\]

Then, conditional on \(\delta\) and the training sample, the output vector \(\tilde{f}(x,0)=f(x)\) is a t process with \(b^*=\tilde{n}-q\) degrees of freedom, posterior mean function

\[m^*(x) = h(x)^{\rm T}\widehat\beta + \tilde{t}(x)^{\rm T} \tilde{A}^{-1} (\tilde{f}(\tilde{D})-\tilde{H}\widehat\beta)\]

and posterior covariance function

\[v^*(x_i,x_j) = \widehat\sigma^2\{c(x_i,x_j) - \tilde{t}(x_i)^{\rm T} \tilde{A}^{-1} \tilde{t}(x_j) + \left( h(x_i)^{\rm T} - \tilde{t}(x_i)^{\rm T} \tilde{A}^{-1}\tilde{H} \right) \left( \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}\right)^{-1} \left( h(x_j)^{\rm T} - \tilde{t}(x_j)^{\rm T} \tilde{A}^{-1}\tilde{H} \right)^{\rm T} \}.\]

This is the first part of the emulator as discussed in DiscGPBasedEmulator. The emulator is formally completed by a second part comprising the posterior distribution of \(\delta\), which has density given by

\[\pi_\delta^*(\delta) \propto \pi_\delta(\delta) \times (\widehat\sigma^2)^{-(\tilde{n}-q)/2}|\tilde{A}|^{-1/2}| \tilde{H}^{\rm T} \tilde{A}^{-1} \tilde{H}|^{-1/2}.\]

In order to derive the sample representation of this posterior distribution for the second part of the emulator, three approaches can be considered.

  1. Exact computations require a sample from the posterior distribution of \(\delta\). This can be obtained by MCMC; a suitable reference can be found below.
  2. A common approximation is simply to fix \(\delta\) at a single value estimated from the posterior distribution. The usual choice is the posterior mode, which can be found as the value of \(\delta\) for which \(\pi^*(\delta)\) is maximised. See the alternatives page on estimators of correlation hyperparameters (AltEstimateDelta).
  3. An intermediate approach first approximates the posterior distribution by a multivariate lognormal distribution and then uses a sample from this distribution, as described in the procedure page ProcApproxDeltaPosterior.

Each of these approaches results in a set of values (or just a single value in the case of the second approach) of \(\delta\), which allow the emulator predictions and other required inferences to be computed.

Although it represents an approximation that ignores the uncertainty in \(\delta\), approach 2 has been widely used. It has often been suggested that, although uncertainty in these correlation hyperparameters can be substantial, taking proper account of that uncertainty through approach 1 does not lead to appreciable differences in the resulting emulator. On the other hand, although this may be true if a good single estimate for \(\delta\) is used, this is not necessarily easy to find, and the posterior mode may sometimes be a poor choice. Approach 3 has not been used much, but can be recommended when there is concern about using just a single \(\delta\) estimate. It is simpler than the full MCMC approach 1, but should capture the uncertainty in \(\delta\) well.

References

Morris, M. D., Mitchell, T. J. and Ylvisaker, D. (1993). Bayesian design and analysis of computer experiments: Use of derivatives in surface prediction. Technometrics, 35, 243-255.