1 About

This topic is about Poisson overdispersion and how it related to model misspecification, i.e. unknown covariates.

1.1 Initialisation

We load libraries, including INLA (Installation and general troubleshooting), and set random seed. Note that the results change when you change the random seed.

library(INLA)
library(fields)
library(ggplot2)
set.seed(201803)

2 The true model

The data is generated by \[y_i \sim \text{Pois}(\lambda_i = \text{exp}(\eta_i)) \] The linear predictor is \[\eta_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + v_1 x_3,\] The three \(\beta\)s we want to infer from the data, and we know the two covariates \(x_1\) and \(x_2\). However, we do not know the covariate \(x_3\), and so, we cannot infer \(v1\)!

2.1 Why we consider this problem

Does this happen in practice? (An unmeasured covariate?) Yes, nearly in all applications!

What people do in most cases is ignore the unknown covariate(s), and fit a model to the data you have. From a statistical point of view, this is treating the \(v_1 x_3\) as noise, and inferring the rest of the model. Next, we explore what happens when you do this.

2.2 Topic similar to BTopic124

The main difference between the two topics is the use of a Poisson likelihood instead of a Gaussian likelihood. We will refer to BTopic124 as the “Gaussian case”.

3 Simulating a dataset

Set the parameters and covariates. Assuming the covariates are scaled, we create covariates with a mean of 0 and a standard deviation near 1. Feel free to download the R code and change these parameter values, note that you get wildly different results when you use different values.

N = 200
x1 = runif(N, min=-2, max=2)
x2 = runif(N, min=-2, max=2)
x3 = runif(N, min=-2, max=2)
beta0 = 3
beta1 = 0.7
beta2 = -0.3
v1 = 0.3

Simulate the truth.

eta = beta0 + beta1*x1 + beta2*x2 + v1*x3
y = rpois(n=N, lambda = exp(eta))

Create the dataframe representing the data collected in the field. Note that we do not add \(x_3\) as this is not measured (it was too expensive or it was never considered).

df = data.frame(y = y, x1=x1, x2=x2)

4 Inference

4.1 Model 1: Poisson

formula1 = y ~ x1 + x2
res1 = inla(formula1, family = "poisson", data = df,
            control.predictor = list(compute=T))
res1$summary.fixed
##              mean    sd 0.025quant 0.5quant 0.97quant  mode kld
## (Intercept)  2.99 0.018       2.95     2.99      3.02  2.99   0
## x1           0.74 0.014       0.71     0.74      0.77  0.74   0
## x2          -0.28 0.012      -0.31    -0.28     -0.26 -0.28   0

4.2 Model 2: Overdispersed Poisson

formula2 = y ~ x1 + x2 + f(id.iid, model="iid")
res2 = inla(formula2, family = "poisson", 
            data = data.frame(df, id.iid = 1:nrow(df)),
            control.predictor = list(compute=T))
res2$summary.fixed
##              mean    sd 0.025quant 0.5quant 0.97quant  mode     kld
## (Intercept)  2.95 0.032       2.88     2.95      3.00  2.95 3.2e-09
## x1           0.71 0.029       0.66     0.71      0.77  0.71 5.6e-10
## x2          -0.29 0.026      -0.34    -0.29     -0.24 -0.29 1.0e-09

4.3 Model 3: The true model

We will not comment on this, but I put it up in case you want to see it.

formula3 = y ~ x1 + x2 + x3
res3 = inla(formula3, family = "poisson", 
            data = data.frame(df, id.iid = 1:nrow(df)),
            control.predictor = list(compute=T))
res3$summary.fixed
##              mean    sd 0.025quant 0.5quant 0.97quant  mode kld
## (Intercept)  2.99 0.018       2.95     2.99      3.02  2.99   0
## x1           0.71 0.014       0.68     0.71      0.74  0.71   0
## x2          -0.29 0.012      -0.32    -0.29     -0.27 -0.29   0
## x3           0.30 0.011       0.28     0.30      0.32  0.30   0

5 Model comparison

Both the model seem to produce reasonable point estimates. This is due to the models being simple, and the data somewhat large. Many different likelihoods would give reasonable estimates, but what separates different likelihoods is mainly the uncertainty estimation. So the question is: Are the intervals reasonable?

5.1 Simulation study

We want to look at coverage of 80% credible intervals. We choose 80% as these need fewer simulations than e.g. 95% intervals for a rough estimate of the coverage probability.

one.sim.coverage = function(x) {
  ## sim data
  df$y = rpois(n=N, lambda = exp(eta))
  
  ## inference
  res1 = inla(formula1, family = "poisson", data = df,
              quantiles = c(0.1, 0.9),
              num.threads = 1)
  res2 = inla(formula2, family = "poisson", 
              data = data.frame(df, id.iid = 1:nrow(df)),
              quantiles = c(0.1, 0.9), 
              num.threads = 1)
  
  ## coverage T/F of 80% interval
  a1 = (beta1 > res1$summary.fixed$`0.1quant`[2] & beta1 < res1$summary.fixed$`0.9quant`[2])
  a1.2 = (beta2 > res1$summary.fixed$`0.1quant`[3] & beta2 < res1$summary.fixed$`0.9quant`[3])
  a2 = (beta1 > res2$summary.fixed$`0.1quant`[2] & beta1 < res2$summary.fixed$`0.9quant`[2])
  a2.2 = (beta2 > res2$summary.fixed$`0.1quant`[3] & beta2 < res2$summary.fixed$`0.9quant`[3])
  return(c(a1, a1.2, a2, a2.2))
}

Let us now run this simulation study. This is somewhat time consuming (5 min).

n.sim = 100
if (require(parallel) & (.Platform$OS.type !="windows")) {
  # - run in parallel if you have this package installed
  # - does not work on windows!
  res.sim.raw = mclapply(as.list(1:n.sim), mc.cores=3, FUN = one.sim.coverage)
} else {
  res.sim.raw = lapply(as.list(1:n.sim), FUN = one.sim.coverage)
}

5.2 Results

We look at the resulting coverage for the two models.

res.sim = as.matrix(data.frame(res.sim.raw))
colnames(res.sim) = NULL
## Model 1: beta1
table(res.sim[1, ])
## 
## FALSE  TRUE 
##    77    23
## Model 1: beta2
table(res.sim[2, ])
## 
## FALSE  TRUE 
##    32    68
## Model 2: beta1
table(res.sim[3, ])
## 
## FALSE  TRUE 
##     8    92
## Model 2: beta2
table(res.sim[4, ])
## 
## FALSE  TRUE 
##     9    91

We see that Model 1 has a very bad coverage. This is due to bad estimates of the uncertainty, and hence bad estimates of the credible intervals. The coverage of Model 2 is better, because the \(v_1x_3\) term is taken into account by the overdispersion (the iid effect). Curiously enough, with our initial random seed, the coverage of Model 2 is a little too good.

6 Discussion

When it comes to fitting Poisson count data, I always include the iid effect, as in Model 2. This is to cover for any unmeasured covariates, in the same way that the Gaussian likelihood automatically covers for unmeasured covariates. I like to have this “overdispersion effect” as a Gaussian on the linear predictor scale, because that is where/how I would add other linear or non-linear effects (e.g. \(v_1x_3\)) if I knew about them. The way I test overdispersion is whether this iid effect is small (in the posterior).

This topic was created, essentially, so that I can use it to show to collaborators, since adding this iid effect in the likelihood/linear predictor is not standard practice. (More commonly, researchers test for overdispersion and use a Negative Binomial likelihood if that is the case.)