.. _DiscPostModeDelta: Discussion: Finding the posterior mode of correlation lengths ============================================================= Description and Background -------------------------- This page discusses some details involved in the maximisation of the posterior distribution of the correlation lengths :math:`\delta`. The details discussed are the logarithmic transformation of the correlation lengths and the incorporation of derivatives of the posterior distribution in the optimisation process. Discussion ---------- Logarithmic transformation ~~~~~~~~~~~~~~~~~~~~~~~~~~ Applying a logarithmic transformation to the correlation lengths can help finding the posterior mode, because it transforms a constrained optimisation problem to an unconstrained one. The transformation that can be applied is .. math:: \tau = 2\ln(\delta). Additionally, it is common to maximise the logarithm of the posterior distribution, rather than the posterior itself. Therefore, the function that has to be maximised is .. math:: g(\tau) = \ln(\pi^*_{\delta}(\exp(\tau/2))) which after substitution of :math:`\pi^*_{\delta}(\delta)` from the procedure page on building a Gaussian Process emulator for the core problem (:ref:`ProcBuildCoreGP`), we have .. math:: \displaystyle g(\tau) \propto \ln(\pi_{\delta}(\exp(\tau/2))) - \frac{n-q}{2}\ln(\hat{\sigma}^2) - 0.5\ln|A| -0.5\ln|H^{\rm T}A^{-1}H|. In the absence of any prior information about the correlation lengths, the prior distribution :math:`\pi_{\delta}(\delta)` can be set to a constant, in which case the entire term :math:`\ln(\pi_{\delta}(\exp(\tau/2)))` can be ignored in the optimisation. We should also note, that the Jacobian of the transformation :math:`\tau = 2\ln(\delta)` is not taken into account. This implies that :math:`\exp(g(\tau))` cannot be strictly viewed as a probability density function, but on the other hand, this results in the mode of :math:`g(\tau)` being located exactly at the same position as the mode of :math:`\pi^*_{\delta}(\delta)` for :math:`\delta = \exp(\tau/2).` Optimisation using derivatives ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Although the function :math:`g(\tau)` can be optimised using a derivatives-free algorithm, the existence of closed form expressions for the two first derivatives of :math:`g(\tau)` can help speeding up the optimisation. In the following, we give the two first derivatives of :math:`g(\tau)`, assuming a uniform correlation length prior (i.e. :math:`\pi_{\delta}(\delta) \propto const.`), which implies that its two first derivatives are zero. In case a different prior is used for :math:`\delta`, its two first derivatives need to be calculated and added to the following expressions. The first derivative of :math:`g(\tau)` is a :math:`p \times 1` vector, whose k-th element is: .. math:: \displaystyle \frac{\partial g(\tau)}{\partial \tau_k} = -\frac{1-n+q}{2} {\rm tr}\left[P\frac{\partial A}{\partial \tau_k}\right] - \frac{n-q}{2}{\rm tr}\left[R\frac{\partial A}{\partial \tau_k}\right] The second derivative is a :math:`p\times p` matrix, whose (k,l)-th entry is .. math:: \displaystyle \frac{\partial^2 g(\tau)}{\partial \tau_l \partial \tau_k} = -\frac{1-n+q}{2}{\rm tr}\left[P\frac{\partial A}{\partial \tau_l \partial \tau_k} - P\frac{\partial A}{\partial \tau_l}P\frac{\partial A}{\partial \tau_k}\right] - \frac{n-q}{2}{\rm tr}\left[R\frac{\partial A}{\partial \tau_l \partial \tau_k} -R\frac{\partial A}{\partial \tau_l}R\frac{\partial A}{\partial \tau_k}\right] The matrices :math:`P, R` are defined as follows .. math:: \displaystyle P &\equiv A^{-1} - A^{-1}H(H^{\rm T}A^{-1}H)^{-1}H^{\rm T}A^{-1} \\ \displaystyle R &\equiv P - Pf(D)(f(D)^{\rm T}Pf(D))^{-1}f(D)^{\rm T}P The derivatives of the matrix :math:`A` w.r.t :math:`\tau` are :math:`n \times n` matrices, whose (i,j)-th element is .. math:: \displaystyle \left(\frac{\partial A}{\partial \tau_k}\right)_{i,j} = (A)_{i,j}\frac{(x_{k,i}-x_{k,j})^2}{e^{\tau_k}} In the above equation, :math:`(\cdot)_{i,j}` denotes the (i,j)-th element of a matrix, and the double subscripted :math:`x_{k,i}` denotes the k-th input of the i-th design point. The second cross derivative of :math:`A` is .. math:: \displaystyle \left(\frac{\partial^2 A}{\partial \tau_l \partial \tau_k}\right)_{i,j} = (A)_{i,j}\frac{(x_{l,i}-x_{l,j})^2}{e^{\tau_l}}\frac{(x_{k,i}-x_{k,j})^2}{e^{\tau_k}} and finally, .. math:: \displaystyle \left(\frac{\partial^2 A}{\partial \tau_k^2}\right)_{i,j} = (A)_{i,j}\left[\left(\frac{(x_{k,i}-x_{k,j})^2}{e^{\tau_k}}\right)^2 - \frac{(x_{k,i}-x_{k,j})^2}{e^{\tau_k}}\right] If we denote by :math:`\hat{\tau}` the value of :math:`\tau` that maximises :math:`g(\tau)`, the posterior mode of :math:`\pi^*_{\delta}(\delta)` is found simply by the back transformation :math:`\hat{\delta} = \exp(\hat{\tau}/2)`. Additional Comments ------------------- The posterior distribution of the correlation lengths can often have multiple modes, especially in emulators with a large number of inputs. It is therefore recommended that the optimisation algorithm is run several times from different starting points, to ensure that the global maximum is found.