This topic is about Poisson overdispersion and how it related to model misspecification, i.e. unknown covariates.
We load libraries, including INLA (Installation and general troubleshooting), and set random seed. Note that the results change when you change the random seed.
library(INLA)
library(fields)
library(ggplot2)
set.seed(201803)
The data is generated by \[y_i \sim \text{Pois}(\lambda_i = \text{exp}(\eta_i)) \] The linear predictor is \[\eta_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + v_1 x_3,\] The three \(\beta\)s we want to infer from the data, and we know the two covariates \(x_1\) and \(x_2\). However, we do not know the covariate \(x_3\), and so, we cannot infer \(v1\)!
Does this happen in practice? (An unmeasured covariate?) Yes, nearly in all applications!
What people do in most cases is ignore the unknown covariate(s), and fit a model to the data you have. From a statistical point of view, this is treating the \(v_1 x_3\) as noise, and inferring the rest of the model. Next, we explore what happens when you do this.
The main difference between the two topics is the use of a Poisson likelihood instead of a Gaussian likelihood. We will refer to BTopic124 as the “Gaussian case”.
Set the parameters and covariates. Assuming the covariates are scaled, we create covariates with a mean of 0 and a standard deviation near 1. Feel free to download the R code and change these parameter values, note that you get wildly different results when you use different values.
N = 200
x1 = runif(N, min=-2, max=2)
x2 = runif(N, min=-2, max=2)
x3 = runif(N, min=-2, max=2)
beta0 = 3
beta1 = 0.7
beta2 = -0.3
v1 = 0.3
Simulate the truth.
eta = beta0 + beta1*x1 + beta2*x2 + v1*x3
y = rpois(n=N, lambda = exp(eta))
Create the dataframe representing the data collected in the field. Note that we do not add \(x_3\) as this is not measured (it was too expensive or it was never considered).
df = data.frame(y = y, x1=x1, x2=x2)
formula1 = y ~ x1 + x2
res1 = inla(formula1, family = "poisson", data = df,
control.predictor = list(compute=T))
res1$summary.fixed
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) 2.99 0.018 2.95 2.99 3.02 2.99 0
## x1 0.74 0.014 0.71 0.74 0.77 0.74 0
## x2 -0.28 0.012 -0.31 -0.28 -0.26 -0.28 0
formula2 = y ~ x1 + x2 + f(id.iid, model="iid")
res2 = inla(formula2, family = "poisson",
data = data.frame(df, id.iid = 1:nrow(df)),
control.predictor = list(compute=T))
res2$summary.fixed
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) 2.95 0.032 2.88 2.95 3.00 2.95 3.2e-09
## x1 0.71 0.029 0.66 0.71 0.77 0.71 5.6e-10
## x2 -0.29 0.026 -0.34 -0.29 -0.24 -0.29 1.0e-09
We will not comment on this, but I put it up in case you want to see it.
formula3 = y ~ x1 + x2 + x3
res3 = inla(formula3, family = "poisson",
data = data.frame(df, id.iid = 1:nrow(df)),
control.predictor = list(compute=T))
res3$summary.fixed
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) 2.99 0.018 2.95 2.99 3.02 2.99 0
## x1 0.71 0.014 0.68 0.71 0.74 0.71 0
## x2 -0.29 0.012 -0.32 -0.29 -0.27 -0.29 0
## x3 0.30 0.011 0.28 0.30 0.32 0.30 0
Both the model seem to produce reasonable point estimates. This is due to the models being simple, and the data somewhat large. Many different likelihoods would give reasonable estimates, but what separates different likelihoods is mainly the uncertainty estimation. So the question is: Are the intervals reasonable?
We want to look at coverage of 80% credible intervals. We choose 80% as these need fewer simulations than e.g. 95% intervals for a rough estimate of the coverage probability.
one.sim.coverage = function(x) {
## sim data
df$y = rpois(n=N, lambda = exp(eta))
## inference
res1 = inla(formula1, family = "poisson", data = df,
quantiles = c(0.1, 0.9),
num.threads = 1)
res2 = inla(formula2, family = "poisson",
data = data.frame(df, id.iid = 1:nrow(df)),
quantiles = c(0.1, 0.9),
num.threads = 1)
## coverage T/F of 80% interval
a1 = (beta1 > res1$summary.fixed$`0.1quant`[2] & beta1 < res1$summary.fixed$`0.9quant`[2])
a1.2 = (beta2 > res1$summary.fixed$`0.1quant`[3] & beta2 < res1$summary.fixed$`0.9quant`[3])
a2 = (beta1 > res2$summary.fixed$`0.1quant`[2] & beta1 < res2$summary.fixed$`0.9quant`[2])
a2.2 = (beta2 > res2$summary.fixed$`0.1quant`[3] & beta2 < res2$summary.fixed$`0.9quant`[3])
return(c(a1, a1.2, a2, a2.2))
}
Let us now run this simulation study. This is somewhat time consuming (5 min).
n.sim = 100
if (require(parallel) & (.Platform$OS.type !="windows")) {
# - run in parallel if you have this package installed
# - does not work on windows!
res.sim.raw = mclapply(as.list(1:n.sim), mc.cores=3, FUN = one.sim.coverage)
} else {
res.sim.raw = lapply(as.list(1:n.sim), FUN = one.sim.coverage)
}
We look at the resulting coverage for the two models.
res.sim = as.matrix(data.frame(res.sim.raw))
colnames(res.sim) = NULL
## Model 1: beta1
table(res.sim[1, ])
##
## FALSE TRUE
## 81 19
## Model 1: beta2
table(res.sim[2, ])
##
## FALSE TRUE
## 31 69
## Model 2: beta1
table(res.sim[3, ])
##
## FALSE TRUE
## 9 91
## Model 2: beta2
table(res.sim[4, ])
##
## FALSE TRUE
## 12 88
We see that Model 1 has a very bad coverage. This is due to bad estimates of the uncertainty, and hence bad estimates of the credible intervals. The coverage of Model 2 is better, because the \(v_1x_3\) term is taken into account by the overdispersion (the iid effect). Curiously enough, with our initial random seed, the coverage of Model 2 is a little too good.
When it comes to fitting Poisson count data, I always include the iid effect, as in Model 2. This is to cover for any unmeasured covariates, in the same way that the Gaussian likelihood automatically covers for unmeasured covariates. I like to have this “overdispersion effect” as a Gaussian on the linear predictor scale, because that is where/how I would add other linear or non-linear effects (e.g. \(v_1x_3\)) if I knew about them. The way I test overdispersion is whether this iid effect is small (in the posterior).
This topic was created, essentially, so that I can use it to show to collaborators, since adding this iid effect in the likelihood/linear predictor is not standard practice. (More commonly, researchers test for overdispersion and use a Negative Binomial likelihood if that is the case.)