Procedure: Empirical construction of a Bayes linear emulator for the core problem using only simulator runs

Description and Background

Ordinarily, when constructing Bayes linear emulators we rely on prior information to direct the choice of emulator structure and we combine prior beliefs about the emulator parameters with model evaluations to generate our emulator following the methods of the procedure on building a Bayes linear emulator for the core problem (ProcBuildCoreBL). However, when the computer model is relatively inexpensive to evaluate and prior information is comparatively limited, then emulator choice may be made on the basis of a very large collection of simulator evaluations. In particular, this approach is used for the emulation of a fast approximate version of the simulator (see multi-scale emulation).

In this situation, we may make many runs of the simulator, allowing us to develop a preliminary view of the form of the function, and thereby to make preliminary choices of the basis function collection \(\{h_j(x)\}\), and suggest an informed prior specification for the random quantities that determine the emulator for the simulator \(f(x)\). This analysis is supported by a diagnostic analysis, for example based on looking for systematic structure in the emulator residuals.

With such a large number of evaluations of the model, the emulator’s global trend can be identified and well-estimated from the data alone without application of Bayes. For a Bayesian treatment at this stage, our prior judgements would be dominated by the large number of model evaluations so typically we use standard model-fitting techniques.

As we are assuming that we have no substantial prior information about the emulators, then we would typically consider evaluating the computer model over a space-filling design over the input space to obtain good coverage and to learn about global variation in \(f(x)\). We assume we start this procedure with a large design (generated by methods such as those discussed in the alternatives page on training sample design for the core problem (AltCoreDesign)), and the corresponding simulator evaluations at each of these input parameter combinations.

Inputs

  • Design \(D\) over the input space comprising the input points \(\{x_1,x_2,\dots,x_n\}\)
  • Output vector \(f(D)=(f(x_1),f(x_2),\dots,f(x_n))^T\), where \(f(x_j)\) is the simulator output corresponding to input vector \(x_j\) in \(D\)
  • A collection of potential basis functions, \(h(\cdot)\), for the prior mean function \(m(\cdot)\)

Outputs

  • A collection of appropriate basis functions for the emulator mean function
  • Expectation vector and variance/covariance matrix for regression coefficients \(\beta\)
  • Adjusted residual process and specification of covariance function hyperparameters \((\sigma^2,\delta)\)

Procedure

The general process for the empirical construction of an emulator proceeds in the following four stages:

  1. Determine the active inputs to the emulator
  2. Determine an appropriate subset of basis functions
  3. Estimate emulator regression coefficients \(\beta\)
  4. Estimate residual process hyperparameters \((\sigma^2,\delta)\)

Determine active inputs

If we have chosen to work with active inputs in the means function, then the first step in constructing the emulator is to identify the subset of inputs, \(x_A\), which drive the majority of global variation in \(f(x)\). For this stage, we require a set of possible basis functions (see the alternatives page on basis functions for the emulator mean (AltBasisFunctions) for details.) Given this set of possible regressors and the ample supply of computer evaluations, we can determine the important inputs and model effects by methods such as stepwise fitting.

There are many possible approaches for empirically determining active variables. The process of identifying active variable is known as screening. Typically, these methods take the form of model selection and model search problems. A simple such approach using backward stepwise regression would be:

  1. Fit the emulator mean function using all possible basis functions - this is now the ‘current’ model
  2. For each input in the current model, remove all terms involving that input variable and re-fit the mean function
  3. Compare each of these sub-models with the current model using an appropriate criterion
  4. The most favourable sub-model now becomes the current model
  5. Iterate until an appropriate stopping criterion is satisfied

When the input space is very high-dimensional, a backward stepwise approach may not be viable due to the large number of possible terms in the initial mean function. In these cases, forward selection approaches would be more appropriate beginning with a simple constant as the initial mean function and adding terms in active variables at each stage rather than removing them. Given a very large collection of potential inputs, where possible, it is helpful to start the stepwise search with a sub-collection of input suggested by expert knowledge of the physical processes. Other approaches to screening are discussed in the topic thread on screening (ThreadTopicScreening).

Determine regression basis functions

The procedure for determining an appropriate collection of regression basis functions is closely related to the problem of active input identification. Again, this is a model selection problem where we now have a reduced set of possible basis functions, all of which now only involve the active inputs. We apply the same methods as above, only this time removing or adding single regression terms in order to arrive at an appropriate and parsimonious representation for the simulator output.

Estimate emulator regression coefficients

The next stage is to quantify beliefs about the emulator coefficients \(\beta\). We obtain values for these prior quantities by fitting the mean function that we have determined in the previous stages to the observed simulator runs to obtain estimates and associated uncertainty statements about the coefficients \(\beta\).

There are a variety of ways in which we could fit the mean function to obtain the estimates \(\hat{\beta}\). If we lack any insight into the nature of the correlation structure of the residual process and our design points are well-separated then we could fit the regression model using ordinary least squares (OLS). Alternatively, if we have information about the correlation function and its parameter values then more appropriate estimates could be obtained by using this information and fitting by generalised least squares (GLS). We might use an iterative approach to estimate both the regression coefficients and the correlation function hyperparameters.

The value of \(\textrm{E}[\beta]\) is then taken to be the estimate \(\hat{\beta}\) from the fitting of the regression model and \(\textrm{Var}[\beta]\) is taken to be the variance of the corresponding estimates. With sufficient evaluations in an approximately orthogonal design, the estimation error here is negligible.

Estimate residual process hyperparameters

The final stage is to make assessments for the values of the covariance function hyperparameters \((\sigma^2,\delta)\) in our covariance specifications for the residual process \(w(x)\).

Typically an estimate for \(\sigma^2\) is obtained from fitting the regression model in the form of the residual mean square \(\hat{\sigma}^2\). Estimating correlation function hyperparameters for the emulator residuals can be a more complex task, which is discussed in the alternatives page on estimators of correlation hyperparameters (AltEstimateDelta). A common empirical approach is variogram fitting.

Validation and post-emulation tasks

Given the emulator, we can perform similar diagnostics, validation, and post-emulation tasks as described in the thread for Bayes linear emulation for the core model (ThreadCoreBL).

References

Craig, P. S., Goldstein, M., Seheult, A. H., and Smith, J. A. (1998), “Constructing partial prior specifications for models of complex physical systems,” Applied Statistics, 47, 37-53.