Alternatives: Optimal design criteria

Overview

Several toolkit operations call for a set of points to be specified in the simulator space of inputs, and the choice of such set is called a design. For instance, to create an emulator of the simulator we need a set of points, the training sample, at which the simulator is to be run to provide data in order to build the emulator.

Optimal design seeks a design which maximises or minimises a suitable criterion relating to the efficiency of the experiment in terms of estimating parameters, prediction and so.

In mainstream optimal design this is based on some function of the variance-covariance matrix \(\sigma^2(X^TX)^{-1}\) from a regression model \(Y = X \beta + \varepsilon\). It is heavily model based since \(X\) depends on the model. Thus, recall that in regression will be built on a vector of functions \(f(x)\): so that the mean at a particular \(x\) is \(\eta(x) = E(Y_x) = \sum \beta_j f_j(x)= \beta^T f(x)\). Then \(X = f_j(x_i)\) and \(x_1, \cdots, x_n\) is the design.

Below, a term like \(\min_{design}\) means minimum over some class of designs, e.g. all designs of sample size \(n\) with design points \(x_1, \cdots, x_n\) in a design space \(\mathcal X\). We may standardise the design space to be, for example \([-1, 1]^d\), where \(\strut{d}\) is the number of scalar inputs, but we can have very non standard design spaces.

For details about optimal design algorithms see, AltOptimalDesignAlgorithms.

Choosing the Alternatives

Each of the criteria listed below serves a specific use, and some comments are made after each.

The Nature of the Alternatives

Classical Optimality Criteria

D-optimality

D-optimality is used to choose a design that minimises the generalised variance of the estimators of the model parameters namely \(\min_{design} det((X^TX)^{-1}) \Leftrightarrow \max_{design} det (X^TX)\). This is perhaps the most famous of optimal design criteria, and has some elegant theory associated with it. Because the determinant is the product of the eigenvalues it is sensitive to \(X\), or equivalently \(X^TX\), being close to not being full rank. Conversely it guards well against this possibility: a small eigenvalue implies that the model is close to being “collinear”, i.e. \(\strut{X}\) is not of full rank. The criterion has the nice property that a linear reparametrisation (such as shifting and rescaling the parameters) has no effect on the criterion

G-optimality

This criterion used to find the design that minimises (over the design space) the variance of the prediction of the mean \(\eta(x)\) at a point \(x\). This variance is equal to \(\sigma^2 f(x)^T (X^TX)^{-1} f(x)\) and therefore \(G\)-optimality means: minimise the maximum value of this variance: \(\min_{design} \max_{x \in \mathcal X} f(x)^T(X^TX)^{-1}f(x)\). The General Equivalence Theorem of Kiefer and Wolfowitz says that \(D\)- and \(G\)-optimality are equivalent (under a general definition of a design as a mass function and over the same design space).

E-optimality

The E-optimality criterion is used when we are interested in estimating any (normalized) linear function of the parameters. It guards against the worst such value: \(\min_{design} \max_j \lambda_j((X^TX)^{-1})\). As such it is even stronger than D-optimality in guarding against collinearity. In other words large eigen-values occur when \(X\) is close to singularity.

Trace optimality

This is used when the aim of the experiment is to minimize the sum of the variances of the least squares estimators of the parameters: \(min_{design} {\rm Tr} (X^TX)^{-1}\). It is one of the easiest criteria in conception, but suffers from not being invariant under linear reparametrisations (unlike D-optimality). It is more appropriate when parameter have clear meanings, eg as the effect of a particular factor.

A-optimality:

It is used when the aim of the experiment is to estimate more than one special linear functions of the parameters, e.g. \(K\beta\). The least squares estimate has \({\rm Var}[K^T \hat{\beta}] = \sigma^2 K^T (X^TX)^{-1}K\). Using the trace criterion, this has trace: \(\sigma^2 {\rm Tr} (X^TX)^{-1}A\) with \(A=KK^T\). Care has to be taken because the criterion is not invariant with respect to reparametrisation, including rescaling, of parameters and because \(A\) affects the importance given to different parameters.

c-optimality

This is to minimise the variance of a particular linear function of the parameters: \(\phi = \sum c_j \beta_j =c^T\beta\) and \(\min_{design} c^T (X^TX)^{-1}c\). It is a special case of A-optimality and has a nice geometric theory of its own. It was one of the earliest studied criteria.

Bayesian optimal experimental design

There is an important general approach to design in the Bayesian setting. The choice of experimental design is a decision which may be made after the specification of any prior distribution on the parameters in a model but before the observations are made (outputs are measured). In this sense it is an pre-posterior decision. Suppose that \(\beta\) refers to the unknown parameters of interest; not just the labeled parameters but may include predicted outputs at special points. Let \(\phi(\pi(\beta|Y))\) be a functional on the posterior distribution \(\pi(\beta|Y)\) (after the experiment is conducted) which measures some feature of the posterior such as its peakedness. In the case that \(\phi = E(L(\hat{\beta}, \beta))\), for some estimator \(\hat{\beta}\) and loss function \(L\) then \(\phi\) is the posterior Bayes risk of \(\hat{\beta}\).

If \(\phi = \min_{\hat{\beta}} E(L(\hat{\beta}, \beta))\) then \(\hat{\beta}\) is the Bayes estimator with respect to \(L\) and we have the (posterior) Bayes risk. This is ideal from the Bayes standpoint although it may be computationally easier to use a simpler non-Bayes estimator but still compute the Bayes risk.

The full Bayesian optimal design criterion with respect to \(\phi\) is \(\min_{design} E_Y \phi(\pi(\beta|Y)),\) where \(Y\) is the output generated by the experiment. Here, of course, the distribution of \(Y\) is affected by the design. In areas such as non-linear regression one make be able to compute a local optimal design using a classical estimator such as the maximum likelihood estimator. In such a case the \(\phi\) value may depend on this unknown \(\beta\): \(\phi(\beta)\). An approximate Bayesian criteria is then \(\min_{design} E_{\beta}(\phi(\beta)),\) where the expectation is with respect to the prior distribution on \(\beta\), \(\pi(\beta)\). The approximate full Bayes criteria (which is computationally harder) and approximate Bayes criteria can give similar solutions.

There are Bayesian analogues of all the classical optimal design criteria listed above. The idea is to replace the variance matrix of the least squares estimates of the regression parameter \(\beta\), by the posterior variance matrix \({\rm Var}[\beta |Y ]\). Thus, if we take the standard regression model: \(Y = X \beta + \varepsilon\) and let \(\beta \sim N(\mu, \sigma^2 I )\) and \(\mu \sim N(0, \Gamma)\), then \({\rm Var}[\beta |Y ] = (\sigma^{-2}X^TX + \Gamma^{-1})^{-1}\).

Important note: in this simple case the posterior covariance does not depend on the observed data, so that the prior expectation \(E_Y\) in the Bayes rule for design, is not needed.

Bayesian Information criteria

The information-theoretical approach to experimental design goes some way towards being an all purpose philosophy. It is easiest to explain by \(\phi(\pi)\) in the Bayes formulation to be Shannon entropy. For a general random variable \(X\) with pdf \(p(x)\) this is \({\rm Ent}(X)= - E_X(\log(p(X)) = - \int \log(p(x)) p(x) dx\).

Shannon information is \(Inf(X) =- {\rm Ent}(X)\).

The information approach is to minimise the expected posterior entropy.

\[\min_{design} E_Y {\rm Ent}(\beta|Y)\]

This is often expressed as: maximise the expected information gain \(E_Y(G)\), where: \(G=Inf(\beta|Y) - Inf(\beta)\), where the second term on the right hand side is the prior information. An important result says that \(E_Y(G)\) is always non-negative, although in actual experiment cases \(G\) may decrease. This result can be generalised to a wider class of information criteria which include Tsallis entropy \(E_X \left(\frac{p(X)^{\alpha}-1}{\alpha}\right),\;\;\alpha > -1\).

Maximum Entropy Sampling (MES)

This is a special way of using the Bayesian information criterion for prediction. We exploit an importat formula for Shannon entropy which applies to two random variables: \(( U,V )\):

\[\mbox{Ent}(U,V) = \mbox{Ent}(U) + \mbox{E}_U (\mbox{Ent}(V|U).\]

Let \(S\) represent a set of candidate points, which covers the whole design space well. This could for example be a large factorial design, or a large space-filling design. Let and let \(s\) be a possible subset of \(s\), to be use as a design. Then the complement of \(s\) namely \(s\) can be thought of as the “unsampled” points.

Then partition \(Y\): \((Y_s,Y_{S \setminus s})\). Then for prediction we could consider: \(\mbox{Ent}(Y_s,Y_{S \setminus s}) = \mbox{Ent}(Y_s) + \mbox{E}_{Y_s} [\mbox{Ent}(Y_{S \setminus s}|Y_s)]\). The joint entropy on the left hand side is fixed, that is does not depend on the choics of design. Therefore, since the Bayes criterion is to minimise, over the choice of design, the second term on the right, it is optimal to maximise the first term on the right. This is simply the entropy of the sample, and typically requires no conditioning computation. For the simple Bayesian Gaussian case we have

\[\max_{design} |R + X \Gamma X^T|.\]

Where \(R\) is the variance matrix of the process part of the model, \(X\) is the design matrix for the regression part of the model and \(\Gamma\) is prior variance of the regression parameters (as above).

We see that \(Y_{S \setminus s}\) plays the role of the unknown parameter in the general Bayes formulation. This is familiar in the Bayesian context under the heading predictive distributions. To summarise: select the design to achieve mimumum entropy (= maximum information) of the joint predictive distribution for all the unsampled points, by maximising the entropy of the sampled points.

Integrated Mean Squared Error (IMSE)

This criterion aims at minimising the mean squared error of prediction over the unsampled points. As for MES, above, this is based on the predictive distribution for the unsampled points.

The mean squared prediction error (MSPE), under the standard model at a single point is given by

\[\begin{split}\mbox{MSPE}(\hat{Y}(x))=\sigma^2\left[ 1-(f(x)^T \quad r(x)^T)\left[ \begin{array}{cc} 0 & F^T \\ F & R \end{array}\right]\left(\begin{array}{c} f(x)\\ r(x) \end{array}\right)\right]\end{split}\]

Several criteria could be based on this quantity as the point \(x\) ranges over the design space \(\mathcal{X}\).

The integrated mean squared prediction error, which is the predictive error averaged over, the design space, \(\mathcal{X}\),

\[J(\mathcal{D})=\int_{\mathcal{X}}\frac{\mbox{MSPE}[\hat{Y}(x)]}{\sigma^2}\omega(x)dx\]

where \(\omega(\cdot)\) is a specified non-negative weight function satisfying \(\int_{\mathcal{X}}\omega(x) dx\).

This criterion has been favoured by several authors in computer experiments, (Sacks et al,1989).

Other criteria are (i) minimising over the design the maximum MSPE over the unsampled points (ii) minimising the biggest eigenvalues of the predictive (posterior) covariance matrix. Note that each of the above criteria is the analogue of a classical criterion in which the parameters are replaced by the values of the process at the unsampled points in the candidate set: thus Maximum Entropy Sampling (in the Gaussian case) is the analogue of D-optimality, IMSE is a type of trace-optimality and (i) and (ii) above are types of types of G- and E-optimality respectively.

Bayesian sequential optimum design

The Bayesian paradigm is very useful in understanding sequential design.

After we have selected a criterion \(\phi\), see above, the first stage design is to \(\min_{design_1} E_{Y_1} \phi(\pi(\beta|Y_1)),\) where \(Y_1\) is the output generated in the first stage experiment.

The (naive/myopic) one-step-ahead method is to take the prior distribution (process) at stage 2 to be the posterior process having observed \(Y_1\) and procead to choose the stage 2 design to minimise \(\min_{design_2} E_{Y_2} \phi(\pi(\beta|Y_1,Y_2)),\), and so on through further stages, if necessary.

The full sequential rule, which is very difficult to implement, uses the knowledge that one will use an optimal design at future stages to adjust the “risk” at the first stage. For two stages this would give, working backwards: \(R_2= \min_{design_2} E_{Y_2} \phi(\pi(\beta|Y_1,Y_2)),\) and then at the first stage choose the first design to achieve \(\min_{design_1} E_{Y_1}( R_2)\). The general case is essentially dynamic programming (Bellman rule) and with a finite horizon \(N\) the scheme would look like:

\[\min_{design_1} E_{Y_1} \min_{design_2} E_{Y_2} \ldots \min_{design_{N-1} } E_{Y_{N-1}} \phi(\pi(\beta|Y_1, \ldots, Y_N))\]

This full version of sequential design is very “expensive” because the internal expectations require much numerical integration.

One can perform global optimisation to obtain an optimal design over a design space \(\mathcal{X}\), but it is convenient to use a large candidate set. As mentioned, this candidate set is typically taken to be a full factorial design or a large space-filling design. In block sequential design one may add a block optimally which may itself be a smaller space-filling design.

It is useful to stress again notation for the design used in the MES method, above. Thus let \(S\) be the candidate set and \(s\) the design points and \(\bar{s} = S \setminus s\) the unsampled points. Then the optimal design problem is an optimal subset problem: choose \(s \subset S\) in an optimal way. This notation helps to describe the important class of exchange algorithms in where points are exchanged between \(s\) and \(\bar{s}\).

A one-point-at-a-time myopic sequential design based on the MES principle places each new design point at the point with maximum posterior variance from the design to date. See also some discussion in DiscCoreDesign.