This topic is about model misspecification, with unknown covariates and a Gaussian likelihood.
We load libraries, including INLA (Installation and general troubleshooting), and set random seed.
library(INLA)
library(fields)
library(ggplot2)
set.seed(201803)
The data is generated by \[y_i \sim \mathcal N(\eta_i, \sigma_\epsilon) \] The linear predictor is \[\eta_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + v_1 x_3,\] The three \(\beta\)s we want to infer from the data, and we know the two covariates \(x_1\) and \(x_2\). However, we do not know the covariate \(x_3\), and so, we cannot infer \(v1\)!
Does this happen in practice? (An unmeasured covariate?) Yes, nearly in all applications!
What people do in most cases is ignore the unknown covariate(s), and fit a model to the data you have. From a statistical point of view, this is treating the \(v_1 x_3\) as noise, and inferring the rest of the model. Next, we explore what happens when you do this.
Set the parameters and covariates. Assuming the covariates are scaled, we create covariates with a mean of 0 and a standard deviation near 1.
N = 500
x1 = runif(N, min=-2, max=2)
x2 = runif(N, min=-2, max=2)
x3 = runif(N, min=-2, max=2)
beta0 = 3
beta1 = 0.7
beta2 = -0.3
v1 = 0.3
sig.eps = 0.4
Simulate the truth.
eta = beta0 + beta1*x1 + beta2*x2 + v1*x3
y = eta + sig.eps*rnorm(N)
Create the dataframe representing the data collected in the field. Note that we do not add \(x_3\) as this is not measured (it was too expensive or it was never considered).
df = data.frame(y = y, x1=x1, x2=x2)
formula = y ~ x1 + x2
res1 = inla(formula, family = "gaussian", data = df,
control.predictor = list(compute=T))
res1$summary.fixed
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) 3.02 0.024 2.97 3.02 3.06 3.02 2.5e-11
## x1 0.70 0.022 0.65 0.70 0.74 0.70 2.7e-11
## x2 -0.31 0.021 -0.35 -0.31 -0.27 -0.31 2.7e-11
We see that the model recovers the true parameters! We do not expect to recover the true noise, as the term \(v_1x_3\) has been added to it.
(sd.model.noise = sd(y-res1$summary.linear.predictor$mean))
## [1] 0.53
(sd.true.noise = sd(y-eta))
## [1] 0.42
# - approximately equal to sig.eps
(sd.v1x3 = sd(v1*x3))
## [1] 0.34
(sqrt(sd.true.noise^2 + sd.v1x3^2))
## [1] 0.54
As expected, the two first numbers are not the same, but the first and the last is similar.
formula = y ~ x1 + x2 + x3
res2 = inla(formula, family = "gaussian", data = df,
control.predictor = list(compute=T))
res2$summary.fixed
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) 3.01 0.019 2.98 3.01 3.05 3.01 2.4e-11
## x1 0.71 0.017 0.67 0.71 0.74 0.71 2.7e-11
## x2 -0.30 0.017 -0.33 -0.30 -0.27 -0.30 2.7e-11
## x3 0.28 0.017 0.25 0.28 0.32 0.28 2.7e-11
With this model, the estimates are better, and the uncertainties smaller than with model 1.
The example in this topic hints to the generally applicable approach of ignoring some unknown covariates and fitting a model to what data you have.
In general, whether the estimates are unbiased and what variance you have depends on the dependency between the known and unknown covariates, and the size of the unknown covariates.
[Please tip me about a few good citations that I can add here.]