1 About

This topic shows how I do a first pass at modelling data with many covariates and/or factors. We use strong shrinkage penalties, also known as ridge regression. For factors we use iid effects, to deal with factor levels with few observations, and when there are many different factor levels.

We simulate synthetic data instead of using real data.

1.1 Pre-selection

We can do a pre-selection first, especially if there are many (say \(>100\)) covariates. There are many different approaches to pre-selection. I will not show any of these methods in this topic, but I mention them here.

  • Univariate: we plot and fit every covariate against the response, one at a time, and we have some kind of criteria for what to include or exclude.
  • Correlation: Remove some variables that have a correlation near 1 with other variables
  • Random forest: Fit a boosting/RF model and pick the most important ones
  • Variance: Select variables with a large variance (only reasonable in some applications)
  • Other approaches
  • Combination: Use a combination of these

In the pre-selection, we have to be careful that we do not delete

  • Covariates that have a significant non-linear effect, but no significant linear one
  • Covariates that say nothing by themselves, but are important together with other covariates

1.2 Covariate transformations

Another point I mention for completeness is that covariates should often be transformed, especially when using linear models (as in this topic). It is often good to try several transformations. The main transformations I use

  • Logarithmic: \(\log(x)\), \(\log(x+1)\), \(\log(x+min(x)+1)\)
  • Exponents: \(x^2\), \(x^3\)
  • Empirical quantile transform: Results in “equidistant behaviour”. Example code below.
## Empirical quantile transform
x = c(0:10, 20:21, 100:105, 1000)
qtx = ecdf(x)
## The transformed variable:
qtx(x)
##  [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65
## [14] 0.70 0.75 0.80 0.85 0.90 0.95 1.00

1.3 Packages

library(INLA)
inla.setOption("num.threads", 1)
set.seed(2021)

2 Simulation

2.1 Motivation

We have many covariates and many observations, many of the covariates have no effect, or a neglible effect. Many of the covariates are correlated. This is supposed to give a good example of typical difficult real data case.

2.2 Input parameters

You can change these to try different things.

## Number of observations
n.obs = 5E3
## The Gaussian noise
sig.epsilon = 1.00
## Covariates
n.cov.important = 7
n.cov.noeffect = 20
## Fraction of new variable that is taken from some previous variable
## 0 gives uncorrelated, infinity gives perfectly correlated
corr.fraction = 0.7
## How many factors
## Please do not change this number (that requires further code changes)
n.factors = 6
## How many factor levels, at most
n.fact.levels = 7
## Typical effect size of important factor level
effect.size.fact = 2.10

2.3 Simulating continuous covariates

We simulate covariates from a normal distribution. Often the distribution is not so nice at all! But, worst case, we can do an empirical quantile transform and have nice marginals behaviour.

## Matrix of covariates 
## Build this column by column
X = cbind(rnorm(n.obs))
## What previous variable are you correlated with?
corr.with = rep(NA, n.cov.important)
for (i in 2:(n.cov.important+n.cov.noeffect)) {
  ## Choose a previous covariate to be correlated with
  corr.with[i] = sample(1:(i-1), 1)
  ## Create a new covariate
  new = (corr.fraction*X[, corr.with[i]] + rnorm(n.obs))/corr.fraction
  ## Add a column to the matrix of covariates
  X = cbind(X, new)
}
colnames(X) = paste0("x", 1:ncol(X))

A brief look at the variables and correlations:

summary(X[, 1:5])
##        x1             x2             x3             x4      
##  Min.   :-3.9   Min.   :-6.4   Min.   :-5.7   Min.   :-7.7  
##  1st Qu.:-0.7   1st Qu.:-1.2   1st Qu.:-1.1   1st Qu.:-1.5  
##  Median : 0.0   Median : 0.0   Median : 0.0   Median : 0.0  
##  Mean   : 0.0   Mean   : 0.0   Mean   : 0.0   Mean   : 0.0  
##  3rd Qu.: 0.7   3rd Qu.: 1.2   3rd Qu.: 1.2   3rd Qu.: 1.5  
##  Max.   : 3.7   Max.   : 6.5   Max.   : 6.5   Max.   : 8.1  
##        x5      
##  Min.   :-7.9  
##  1st Qu.:-1.2  
##  Median : 0.0  
##  Mean   : 0.0  
##  3rd Qu.: 1.2  
##  Max.   : 6.1
cor(X[, 1:5])
##      x1   x2   x3   x4   x5
## x1 1.00 0.58 0.59 0.45 0.58
## x2 0.58 1.00 0.33 0.78 0.35
## x3 0.59 0.33 1.00 0.25 0.34
## x4 0.45 0.78 0.25 1.00 0.27
## x5 0.58 0.35 0.34 0.27 1.00

2.4 Simulating factors

Simulating factors is slightly harder, since each has several factor levels, and each factor level can have a different effect (on the linear predictor / response).

## Simulate a factor with n.fact.levels factor levels
new = sample(1:n.fact.levels, size=n.obs, replace = T)
## The factor (as strings)
new = paste0("F", 1, "level", new)
## Create the first column of the matrix of factors
FF = cbind(new)
## The effect of the factor
FFeff = cbind(rep(0, n.obs))
## Each factor has an effect level and an effect size
## Effect level gives which of the factor levels has a nonzero effect
effect.level = rep(NA, n.factors)
## Effect size gives how large the effect for the nonzero factor is
## The first factor has an effect of 0
effect.size = rep(NA, n.factors)
for (i in 2:(n.factors)) {
  ## Choose how many factor levels for this factor
  tmp.levels = sample(2:n.fact.levels, 1)
  ## Create a new factor
  new = sample(1:tmp.levels, size=n.obs, replace = T)
  ## Create its effect
  effect.level[i] = sample(1:tmp.levels, 1)
  effect.size[i] = rnorm(1)*effect.size.fact
  new.effect = (new==effect.level[i])*effect.size[i]
  ## Create the column for the new factor and add it to the matrix
  new = paste0("F", i, "level", new)
  FF = cbind(FF, new)
  ## Add the new effect column to the matrix
  FFeff = cbind(FFeff, new.effect)
}
colnames(FF) = paste0("fa", 1:ncol(FF))
colnames(FFeff) = paste0("fa", 1:ncol(FF), "eff")

Check that the factors look ok:

FF[1:8, 2:6]
##      fa2        fa3        fa4        fa5        fa6       
## [1,] "F2level2" "F3level7" "F4level4" "F5level4" "F6level4"
## [2,] "F2level1" "F3level4" "F4level4" "F5level6" "F6level4"
## [3,] "F2level2" "F3level2" "F4level5" "F5level7" "F6level2"
## [4,] "F2level2" "F3level5" "F4level5" "F5level7" "F6level3"
## [5,] "F2level2" "F3level5" "F4level2" "F5level1" "F6level2"
## [6,] "F2level2" "F3level3" "F4level5" "F5level3" "F6level2"
## [7,] "F2level1" "F3level2" "F4level5" "F5level1" "F6level4"
## [8,] "F2level1" "F3level2" "F4level1" "F5level2" "F6level2"
FFeff[1:8, 2:6]
##      fa2eff fa3eff fa4eff fa5eff fa6eff
## [1,]    2.3    0.0  -0.25    0.0   0.00
## [2,]    0.0    3.1  -0.25    0.0   0.00
## [3,]    2.3    0.0   0.00    0.0   0.00
## [4,]    2.3    0.0   0.00    0.0  -0.46
## [5,]    2.3    0.0   0.00    0.0   0.00
## [6,]    2.3    0.0   0.00    1.5   0.00
## [7,]    0.0    0.0   0.00    0.0   0.00
## [8,]    0.0    0.0   0.00    0.0   0.00

2.5 Simulating response

First we simulate the beta coefficients

## 2 alternatives, no effect at all or negligble effect
betas = c(abs(rnorm(n.cov.important))+0.5, rnorm(n.cov.noeffect)*0.000) * 0.1
# betas = c(abs(rnorm(n.cov.important))+0.5, rnorm(n.cov.noeffect)*0.001) * 0.1
gaussian.error = rnorm(n.obs, 0, sd=sig.epsilon)
## Compute the response from the simulated values
eta1 = X %*% betas
eta2 = rowSums(FFeff)
y = eta1 + eta2 + gaussian.error

Compare the variances of the different components. If the error variance is high, compared to the eta varainces, we have little information, and vice versa.

sd(eta1)
## [1] 1.4
sd(eta2)
## [1] 1.7
sd(gaussian.error)
## [1] 0.98

3 Fit INLA models with continuous covariates

3.1 Linear regression

The following formula and inla call makes it very easy to program models with many covariates. More commonly, people put in all the covariate names in the formula, but that is inconvenient and prone to making mistakes.

form1 = y ~ 1 + X
fit1 = inla(form1, family="gaussian", data = list(y=y, X=X))
## Look at the first 10 fixed effects
fit1$summary.fixed[1:8, 1:2]
##              mean    sd
## (Intercept) 1.635 0.028
## x1          0.132 0.051
## x2          0.095 0.052
## x3          0.084 0.034
## x4          0.115 0.028
## x5          0.275 0.034
## x6          0.261 0.027
## x7          0.064 0.028
## How far is the estimate from the truth?
betas - fit1$summary.fixed$mean[-1]
##  [1]  0.00066 -0.03476 -0.02260  0.04595 -0.02337 -0.01539  0.00934
##  [8] -0.00746 -0.05116 -0.02132  0.01839  0.02465 -0.03092  0.01414
## [15]  0.00974  0.01200  0.00262  0.01497 -0.00138  0.01461 -0.01416
## [22]  0.03253 -0.00291 -0.00549  0.01519 -0.01298  0.01940
## Total RMSE
sqrt(sum((betas - fit1$summary.fixed$mean[-1])^2))
## [1] 0.11
## Is the estimate within two std errors?
table((abs(betas - fit1$summary.fixed$mean[-1]))<fit1$summary.fixed$sd[-1]*2)
## 
## TRUE 
##   27
## Quantiled model sd
quantile(fit1$summary.fixed$sd[-1])
##    0%   25%   50%   75%  100% 
## 0.019 0.020 0.027 0.028 0.052

3.2 Vague ridge regression

We fit the same model as before, but we put a vague \(\mathcal N(0, 1)\) prior on all the betas.

fit2 = inla(form1, family="gaussian", data = list(y=y, X=X),
            control.fixed = list(prec=1))
## Total RMSE
sqrt(sum((betas - fit2$summary.fixed$mean[-1])^2))
## [1] 0.11
## Is the estimate within two std errors?
table((abs(betas - fit2$summary.fixed$mean[-1]))<fit2$summary.fixed$sd[-1]*2)
## 
## TRUE 
##   27
## Quantiled model sd
quantile(fit2$summary.fixed$sd[-1])
##    0%   25%   50%   75%  100% 
## 0.019 0.020 0.027 0.028 0.051

3.3 Strict ridge regression

We fit the same model as before, but we put a strong \(\mathcal N(0, 0.1^2)\) prior on all the betas.

fit3 = inla(form1, family="gaussian", data = list(y=y, X=X),
            control.fixed = list(prec=100))
## Total RMSE
sqrt(sum((betas - fit3$summary.fixed$mean[-1])^2))
## [1] 0.1
## Is the estimate within two std errors?
table((abs(betas - fit3$summary.fixed$mean[-1]))<fit3$summary.fixed$sd[-1]*2)
## 
## TRUE 
##   27
## Quantiled model sd
quantile(fit3$summary.fixed$sd[-1])
##    0%   25%   50%   75%  100% 
## 0.019 0.019 0.026 0.026 0.045

4 Fit INLA models with Factor variables

We now proceed to using the factors instead of the covariates. We will add back the covariates at the end.

4.1 No hyperparameters

## Fixed hyperparameter in the iid effect is the same as putting
## control.fixed = list(prec="what you fix it to")
## The theta1 is the log precision
hyper.fix = list(theta1 = list(initial=log(1), fixed=T))
## Dynamic hyperparameter, with a prior for log precision that has median 0.1
hyper.iid = list(theta1 = list(prior="pc.prec", param=c(0.1, 0.5)))

For these formulas, we access the factors directly, so we need to have them in the data directly.

data1 = as.list(as.data.frame(FF))
data1$y = drop(y)
str(data1)
## List of 7
##  $ fa1: chr [1:5000] "F1level1" "F1level1" "F1level4" "F1level1" ...
##  $ fa2: chr [1:5000] "F2level2" "F2level1" "F2level2" "F2level2" ...
##  $ fa3: chr [1:5000] "F3level7" "F3level4" "F3level2" "F3level5" ...
##  $ fa4: chr [1:5000] "F4level4" "F4level4" "F4level5" "F4level5" ...
##  $ fa5: chr [1:5000] "F5level4" "F5level6" "F5level7" "F5level7" ...
##  $ fa6: chr [1:5000] "F6level4" "F6level4" "F6level2" "F6level3" ...
##  $ y  : num [1:5000] 2.614 1.819 1.887 2.611 0.632 ...

For the factors, we write out all the names explicitly in the code. If you change the number of factors under “Input”, you also have to change the folowing formulas.

form2 = y ~ 1 + f(fa1, model="iid", hyper=hyper.fix) + f(fa2, model="iid", hyper=hyper.fix) + f(fa3, model="iid", hyper=hyper.fix) + f(fa4, model="iid", hyper=hyper.fix) + f(fa5, model="iid", hyper=hyper.fix) + f(fa6, model="iid", hyper=hyper.fix)
fit4 = inla(form2, family="gaussian", data = data1)

We look at the different factor levels.

fit4$summary.random$fa1[, 1:3]
##         ID   mean   sd
## 1 F1level1 -0.112 0.38
## 2 F1level2  0.042 0.38
## 3 F1level3  0.035 0.38
## 4 F1level4 -0.046 0.38
## 5 F1level5  0.065 0.38
## 6 F1level6 -0.052 0.38
## 7 F1level7  0.070 0.38
print(paste("Effect on level", effect.level[2], "is", effect.size[2], 
            "compared to baseline 0"))
## [1] "Effect on level 2 is 2.32970852456796 compared to baseline 0"
fit4$summary.random$fa2[, 1:3]
##         ID mean   sd
## 1 F2level1 -1.2 0.71
## 2 F2level2  1.2 0.71

4.2 Allow hyperparameters

This will run slower than the previous, since it optimises hyperparameters. Inference with a factor that does nothing (fa1) often gives some numerical problems. So, if we get some warnings, it is expected / should still be ok. For real data, you might remove factors, that have no effect, like this fa1, and then you should no longer get warnings.

form3 = y ~ 1 + f(fa1, model="iid", hyper=hyper.iid) + f(fa2, model="iid", hyper=hyper.iid) + f(fa3, model="iid", hyper=hyper.iid) + f(fa4, model="iid", hyper=hyper.iid) + f(fa5, model="iid", hyper=hyper.iid) + f(fa6, model="iid", hyper=hyper.iid)
fit5 = inla(form3, family="gaussian", data = data1,
            control.inla = list(int.strategy="eb"))
## We see that the precision for fa1 is very big, so there is little effect of fa1
fit5$summary.hyperpar[, 1:2]
##                                           mean      sd
## Precision for the Gaussian observations   0.34 7.3e-03
## Precision for fa1                       249.45 2.9e+02
## Precision for fa2                         4.07 7.3e+00
## Precision for fa3                         0.90 3.7e-01
## Precision for fa4                       443.59 1.2e+03
## Precision for fa5                         4.32 2.0e+00
## Precision for fa6                        25.10 2.0e+01
## The following uncertainties are much better than in model 4
fit5$summary.random$fa1[, 1:3]
##         ID   mean    sd
## 1 F1level1 -0.071 0.058
## 2 F1level2  0.026 0.058
## 3 F1level3  0.023 0.057
## 4 F1level4 -0.030 0.058
## 5 F1level5  0.041 0.057
## 6 F1level6 -0.034 0.057
## 7 F1level7  0.045 0.058
## The following uncertainties are better than in model 4
fit5$summary.random$fa2[, 1:3]
##         ID mean   sd
## 1 F2level1 -1.2 0.67
## 2 F2level2  1.2 0.67

5 Fit INLA with everything

We fit all the covariates and factors at the same type, with dynamical priors for the factors. The comments about numerical problems and warnings from the last section apply here too.

form4 = y ~ 1 + X + f(fa1, model="iid", hyper=hyper.iid) + f(fa2, model="iid", hyper=hyper.iid) + f(fa3, model="iid", hyper=hyper.iid) + f(fa4, model="iid", hyper=hyper.iid) + f(fa5, model="iid", hyper=hyper.iid) + f(fa6, model="iid", hyper=hyper.iid)
fit6 = inla(form4, family="gaussian", data = c(data1, list(X=X)),
            control.inla = list(int.strategy="eb"),
            control.fixed = list(prec=100))

Continuous covariates:

## Total RMSE
sqrt(sum((betas - fit6$summary.fixed$mean[-1])^2))
## [1] 0.046
## Is the estimate within two std errors?
table((abs(betas - fit6$summary.fixed$mean[-1]))<fit6$summary.fixed$sd[-1]*2)
## 
## FALSE  TRUE 
##     1    26
## Quantiled model sd
quantile(fit6$summary.fixed$sd[-1])
##     0%    25%    50%    75%   100% 
## 0.0096 0.0097 0.0134 0.0137 0.0250

Factors:

print(paste("Effect on level", effect.level[2], "is", effect.size[2], 
            "compared to baseline 0"))
## [1] "Effect on level 2 is 2.32970852456796 compared to baseline 0"
fit6$summary.random$fa2[, 1:3]
##         ID mean  sd
## 1 F2level1 -1.2 1.1
## 2 F2level2  1.2 1.1

5.1 Next steps

This was the first round of modelling. We can look at the results and find ways to proceed.

  • Can any of the factors be removed, or any of the levels merged?
  • Can any of the covariates be removed or transofmred
  • Do we need nonlinear effects of some of the covariates?
  • Do we need any interaction effects?