Discussion: The Observation Equation

Description and background

Observations on the real system, measurement errors and their relation to the real system are presented in the variant thread on linking models to reality using model discrepancy (ThreadVariantModelDiscrepancy). Here, we discuss observations on the real system in more detail.

Notation

In accordance with the standard toolkit notation, we denote the model simulator by \(f\) and its inputs by \(x\). We denote by \(z\) an observation or measurement on a real system value \(y\), and denote by \(\epsilon\) the measurement error (ThreadVariantModelDiscrepancy). The notation covers replicate observations, as described in the discussion page on structured forms for the model discrepancy (DiscStructuredMD).

Discussion

Statistical assumptions

While a value for \(z\) is observed (for example, the temperature at a particular time and place on the Earth’s surface), neither the real system value \(y\) (the actual temperature at that time and place) nor the measurement error \(\epsilon\) is observed. We link these random quantities \(z\), \(y\) and \(\epsilon\) using the observation equation

\[z = y + \epsilon\]

as defined in ThreadVariantModelDiscrepancy.

Typical statistical assumptions are that \(y\) and \(\epsilon\) are independent or uncorrelated with \(\textrm{E}[\epsilon]=0\) and \(\textrm{Var}[\epsilon]=\Sigma_\epsilon\).

The variance matrix \(\Sigma_\epsilon\) is often assumed to be either completely specified (possibly from extensive experience of using the measurement process) and is the same for each observation, or have a particular simple parameterised form; for example, \(\sigma_\epsilon^2\Sigma\), where \(\Sigma\) is completely specified and \(\sigma_\epsilon\) is an unknown scalar standard deviation that we can learn about from the field observations \(z\), especially when there are replicate observations; see DiscStructuredMD.

Simple consequences of the statistical assumptions are \(\textrm{E}[z]=\textrm{E}[y]\) and \(\textrm{Var}[z]=\textrm{Var}[y] + \Sigma_\epsilon\). Thus, \(z\) is unbiased for the expected real system value but the variance of \(z\) is the measurement error variance inflated by \(\textrm{Var}[y]\) which can be quite large, as it involves model discrepancy variance: see, ThreadVariantModelDiscrepancy.

Observations as functions of system values

Sometimes our observations \(z\) are known functions \(g(y)\) of system values \(y\) plus measurement error. The linear case \(z=Hy + \epsilon\), where \(H\) is a matrix which can either select a collection of individual components of \(y\), or more general linear combinations, such as averages of certain components of \(y\), is straightforward and can be dealt with using our current methodology. In this case, both the expectation and variance of \(z\) are simply expressed. Moreover, \(Hy=Hf(x^+) +Hd\), which we can re-write in an obvious reformulation as \(y'=f'(x^+) +d'\), so that \(x^+\) is the still the best input and \(f'\) and \(d'\) are still independent (DiscBestInput).

The case where measurements \(z=g(y)+\epsilon\) are made on a nonlinear function \(g\) can also be reformulared as \(y'=f'(x^+) +d'\) with the usual best input assumptions. Thus, if we put \(y'=g(y)\) and \(f'(x)=g(f(x))\), we may write \(z=y'+\epsilon\) and \(y'=f'(x^{+'})+d'\) with \(d'\) independent of \(f'\) and \(x^{+'}\), and analysis may proceed as before. It should be noted, however, that when \(g\) is nonlinear, it can be shown that it is incoherent to simultaneously apply the best input approach to both \(f\) and \(f'\). However, in practice, we choose the formulation which best suits our purpose.

Additional comments

Note that we are assuming measurement errors are additive, while in some situations they may be multiplicative; that is, \(z=y\epsilon\), in which case we can try either to work directly with the multiplicative relationship or with the additive relationship on the logarithmic scale, \(\log z=\log y+\log \epsilon\), with neither case being straightforward. However, note that this case is covered by the discussion above about nonlinear functions, with \(g(y)=\log (y)\).

Sometimes we have replicate system observations: see DiscStructuredMD for a detailed account, including notation to accommodate generalised indexing such as space-time location of observations, which are regarded as control inputs to the simulator \(f\) in addition to inputs \(x\).