This topic shows how I do a first pass at modelling data with many covariates and/or factors. We use strong shrinkage penalties, also known as ridge regression. For factors we use iid effects, to deal with factor levels with few observations, and when there are many different factor levels.
We simulate synthetic data instead of using real data.
We can do a pre-selection first, especially if there are many (say \(>100\)) covariates. There are many different approaches to pre-selection. I will not show any of these methods in this topic, but I mention them here.
In the pre-selection, we have to be careful that we do not delete
Another point I mention for completeness is that covariates should often be transformed, especially when using linear models (as in this topic). It is often good to try several transformations. The main transformations I use
## Empirical quantile transform
x = c(0:10, 20:21, 100:105, 1000)
qtx = ecdf(x)
## The transformed variable:
qtx(x)
## [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65
## [14] 0.70 0.75 0.80 0.85 0.90 0.95 1.00
library(INLA)
inla.setOption("num.threads", 1)
set.seed(2021)
We have many covariates and many observations, many of the covariates have no effect, or a neglible effect. Many of the covariates are correlated. This is supposed to give a good example of typical difficult real data case.
You can change these to try different things.
## Number of observations
n.obs = 5E3
## The Gaussian noise
sig.epsilon = 1.00
## Covariates
n.cov.important = 7
n.cov.noeffect = 20
## Fraction of new variable that is taken from some previous variable
## 0 gives uncorrelated, infinity gives perfectly correlated
corr.fraction = 0.7
## How many factors
## Please do not change this number (that requires further code changes)
n.factors = 6
## How many factor levels, at most
n.fact.levels = 7
## Typical effect size of important factor level
effect.size.fact = 2.10
We simulate covariates from a normal distribution. Often the distribution is not so nice at all! But, worst case, we can do an empirical quantile transform and have nice marginals behaviour.
## Matrix of covariates
## Build this column by column
X = cbind(rnorm(n.obs))
## What previous variable are you correlated with?
corr.with = rep(NA, n.cov.important)
for (i in 2:(n.cov.important+n.cov.noeffect)) {
## Choose a previous covariate to be correlated with
corr.with[i] = sample(1:(i-1), 1)
## Create a new covariate
new = (corr.fraction*X[, corr.with[i]] + rnorm(n.obs))/corr.fraction
## Add a column to the matrix of covariates
X = cbind(X, new)
}
colnames(X) = paste0("x", 1:ncol(X))
A brief look at the variables and correlations:
summary(X[, 1:5])
## x1 x2 x3 x4
## Min. :-3.9 Min. :-6.4 Min. :-5.7 Min. :-7.7
## 1st Qu.:-0.7 1st Qu.:-1.2 1st Qu.:-1.1 1st Qu.:-1.5
## Median : 0.0 Median : 0.0 Median : 0.0 Median : 0.0
## Mean : 0.0 Mean : 0.0 Mean : 0.0 Mean : 0.0
## 3rd Qu.: 0.7 3rd Qu.: 1.2 3rd Qu.: 1.2 3rd Qu.: 1.5
## Max. : 3.7 Max. : 6.5 Max. : 6.5 Max. : 8.1
## x5
## Min. :-7.9
## 1st Qu.:-1.2
## Median : 0.0
## Mean : 0.0
## 3rd Qu.: 1.2
## Max. : 6.1
cor(X[, 1:5])
## x1 x2 x3 x4 x5
## x1 1.00 0.58 0.59 0.45 0.58
## x2 0.58 1.00 0.33 0.78 0.35
## x3 0.59 0.33 1.00 0.25 0.34
## x4 0.45 0.78 0.25 1.00 0.27
## x5 0.58 0.35 0.34 0.27 1.00
Simulating factors is slightly harder, since each has several factor levels, and each factor level can have a different effect (on the linear predictor / response).
## Simulate a factor with n.fact.levels factor levels
new = sample(1:n.fact.levels, size=n.obs, replace = T)
## The factor (as strings)
new = paste0("F", 1, "level", new)
## Create the first column of the matrix of factors
FF = cbind(new)
## The effect of the factor
FFeff = cbind(rep(0, n.obs))
## Each factor has an effect level and an effect size
## Effect level gives which of the factor levels has a nonzero effect
effect.level = rep(NA, n.factors)
## Effect size gives how large the effect for the nonzero factor is
## The first factor has an effect of 0
effect.size = rep(NA, n.factors)
for (i in 2:(n.factors)) {
## Choose how many factor levels for this factor
tmp.levels = sample(2:n.fact.levels, 1)
## Create a new factor
new = sample(1:tmp.levels, size=n.obs, replace = T)
## Create its effect
effect.level[i] = sample(1:tmp.levels, 1)
effect.size[i] = rnorm(1)*effect.size.fact
new.effect = (new==effect.level[i])*effect.size[i]
## Create the column for the new factor and add it to the matrix
new = paste0("F", i, "level", new)
FF = cbind(FF, new)
## Add the new effect column to the matrix
FFeff = cbind(FFeff, new.effect)
}
colnames(FF) = paste0("fa", 1:ncol(FF))
colnames(FFeff) = paste0("fa", 1:ncol(FF), "eff")
Check that the factors look ok:
FF[1:8, 2:6]
## fa2 fa3 fa4 fa5 fa6
## [1,] "F2level2" "F3level7" "F4level4" "F5level4" "F6level4"
## [2,] "F2level1" "F3level4" "F4level4" "F5level6" "F6level4"
## [3,] "F2level2" "F3level2" "F4level5" "F5level7" "F6level2"
## [4,] "F2level2" "F3level5" "F4level5" "F5level7" "F6level3"
## [5,] "F2level2" "F3level5" "F4level2" "F5level1" "F6level2"
## [6,] "F2level2" "F3level3" "F4level5" "F5level3" "F6level2"
## [7,] "F2level1" "F3level2" "F4level5" "F5level1" "F6level4"
## [8,] "F2level1" "F3level2" "F4level1" "F5level2" "F6level2"
FFeff[1:8, 2:6]
## fa2eff fa3eff fa4eff fa5eff fa6eff
## [1,] 2.3 0.0 -0.25 0.0 0.00
## [2,] 0.0 3.1 -0.25 0.0 0.00
## [3,] 2.3 0.0 0.00 0.0 0.00
## [4,] 2.3 0.0 0.00 0.0 -0.46
## [5,] 2.3 0.0 0.00 0.0 0.00
## [6,] 2.3 0.0 0.00 1.5 0.00
## [7,] 0.0 0.0 0.00 0.0 0.00
## [8,] 0.0 0.0 0.00 0.0 0.00
First we simulate the beta coefficients
## 2 alternatives, no effect at all or negligble effect
betas = c(abs(rnorm(n.cov.important))+0.5, rnorm(n.cov.noeffect)*0.000) * 0.1
# betas = c(abs(rnorm(n.cov.important))+0.5, rnorm(n.cov.noeffect)*0.001) * 0.1
gaussian.error = rnorm(n.obs, 0, sd=sig.epsilon)
## Compute the response from the simulated values
eta1 = X %*% betas
eta2 = rowSums(FFeff)
y = eta1 + eta2 + gaussian.error
Compare the variances of the different components. If the error variance is high, compared to the eta varainces, we have little information, and vice versa.
sd(eta1)
## [1] 1.4
sd(eta2)
## [1] 1.7
sd(gaussian.error)
## [1] 0.98
The following formula and inla call makes it very easy to program models with many covariates. More commonly, people put in all the covariate names in the formula, but that is inconvenient and prone to making mistakes.
form1 = y ~ 1 + X
fit1 = inla(form1, family="gaussian", data = list(y=y, X=X))
## Look at the first 10 fixed effects
fit1$summary.fixed[1:8, 1:2]
## mean sd
## (Intercept) 1.635 0.028
## x1 0.132 0.051
## x2 0.095 0.052
## x3 0.084 0.034
## x4 0.115 0.028
## x5 0.275 0.034
## x6 0.261 0.027
## x7 0.064 0.028
## How far is the estimate from the truth?
betas - fit1$summary.fixed$mean[-1]
## [1] 0.00066 -0.03476 -0.02260 0.04595 -0.02337 -0.01539 0.00934
## [8] -0.00746 -0.05116 -0.02132 0.01839 0.02465 -0.03092 0.01414
## [15] 0.00974 0.01200 0.00262 0.01497 -0.00138 0.01461 -0.01416
## [22] 0.03253 -0.00291 -0.00549 0.01519 -0.01298 0.01940
## Total RMSE
sqrt(sum((betas - fit1$summary.fixed$mean[-1])^2))
## [1] 0.11
## Is the estimate within two std errors?
table((abs(betas - fit1$summary.fixed$mean[-1]))<fit1$summary.fixed$sd[-1]*2)
##
## TRUE
## 27
## Quantiled model sd
quantile(fit1$summary.fixed$sd[-1])
## 0% 25% 50% 75% 100%
## 0.019 0.020 0.027 0.028 0.052
We fit the same model as before, but we put a vague \(\mathcal N(0, 1)\) prior on all the betas.
fit2 = inla(form1, family="gaussian", data = list(y=y, X=X),
control.fixed = list(prec=1))
## Total RMSE
sqrt(sum((betas - fit2$summary.fixed$mean[-1])^2))
## [1] 0.11
## Is the estimate within two std errors?
table((abs(betas - fit2$summary.fixed$mean[-1]))<fit2$summary.fixed$sd[-1]*2)
##
## TRUE
## 27
## Quantiled model sd
quantile(fit2$summary.fixed$sd[-1])
## 0% 25% 50% 75% 100%
## 0.019 0.020 0.027 0.028 0.051
We fit the same model as before, but we put a strong \(\mathcal N(0, 0.1^2)\) prior on all the betas.
fit3 = inla(form1, family="gaussian", data = list(y=y, X=X),
control.fixed = list(prec=100))
## Total RMSE
sqrt(sum((betas - fit3$summary.fixed$mean[-1])^2))
## [1] 0.1
## Is the estimate within two std errors?
table((abs(betas - fit3$summary.fixed$mean[-1]))<fit3$summary.fixed$sd[-1]*2)
##
## TRUE
## 27
## Quantiled model sd
quantile(fit3$summary.fixed$sd[-1])
## 0% 25% 50% 75% 100%
## 0.019 0.019 0.026 0.026 0.045
We now proceed to using the factors instead of the covariates. We will add back the covariates at the end.
## Fixed hyperparameter in the iid effect is the same as putting
## control.fixed = list(prec="what you fix it to")
## The theta1 is the log precision
hyper.fix = list(theta1 = list(initial=log(1), fixed=T))
## Dynamic hyperparameter, with a prior for log precision that has median 0.1
hyper.iid = list(theta1 = list(prior="pc.prec", param=c(0.1, 0.5)))
For these formulas, we access the factors directly, so we need to have them in the data directly.
data1 = as.list(as.data.frame(FF))
data1$y = drop(y)
str(data1)
## List of 7
## $ fa1: chr [1:5000] "F1level1" "F1level1" "F1level4" "F1level1" ...
## $ fa2: chr [1:5000] "F2level2" "F2level1" "F2level2" "F2level2" ...
## $ fa3: chr [1:5000] "F3level7" "F3level4" "F3level2" "F3level5" ...
## $ fa4: chr [1:5000] "F4level4" "F4level4" "F4level5" "F4level5" ...
## $ fa5: chr [1:5000] "F5level4" "F5level6" "F5level7" "F5level7" ...
## $ fa6: chr [1:5000] "F6level4" "F6level4" "F6level2" "F6level3" ...
## $ y : num [1:5000] 2.614 1.819 1.887 2.611 0.632 ...
For the factors, we write out all the names explicitly in the code. If you change the number of factors under “Input”, you also have to change the folowing formulas.
form2 = y ~ 1 + f(fa1, model="iid", hyper=hyper.fix) + f(fa2, model="iid", hyper=hyper.fix) + f(fa3, model="iid", hyper=hyper.fix) + f(fa4, model="iid", hyper=hyper.fix) + f(fa5, model="iid", hyper=hyper.fix) + f(fa6, model="iid", hyper=hyper.fix)
fit4 = inla(form2, family="gaussian", data = data1)
We look at the different factor levels.
fit4$summary.random$fa1[, 1:3]
## ID mean sd
## 1 F1level1 -0.112 0.38
## 2 F1level2 0.042 0.38
## 3 F1level3 0.035 0.38
## 4 F1level4 -0.046 0.38
## 5 F1level5 0.065 0.38
## 6 F1level6 -0.052 0.38
## 7 F1level7 0.070 0.38
print(paste("Effect on level", effect.level[2], "is", effect.size[2],
"compared to baseline 0"))
## [1] "Effect on level 2 is 2.32970852456796 compared to baseline 0"
fit4$summary.random$fa2[, 1:3]
## ID mean sd
## 1 F2level1 -1.2 0.71
## 2 F2level2 1.2 0.71
This will run slower than the previous, since it optimises hyperparameters. Inference with a factor that does nothing (fa1) often gives some numerical problems. So, if we get some warnings, it is expected / should still be ok. For real data, you might remove factors, that have no effect, like this fa1, and then you should no longer get warnings.
form3 = y ~ 1 + f(fa1, model="iid", hyper=hyper.iid) + f(fa2, model="iid", hyper=hyper.iid) + f(fa3, model="iid", hyper=hyper.iid) + f(fa4, model="iid", hyper=hyper.iid) + f(fa5, model="iid", hyper=hyper.iid) + f(fa6, model="iid", hyper=hyper.iid)
fit5 = inla(form3, family="gaussian", data = data1,
control.inla = list(int.strategy="eb"))
## We see that the precision for fa1 is very big, so there is little effect of fa1
fit5$summary.hyperpar[, 1:2]
## mean sd
## Precision for the Gaussian observations 0.34 6.9e-03
## Precision for fa1 267.37 3.2e+02
## Precision for fa2 1.98 1.4e+00
## Precision for fa3 1.33 6.8e-01
## Precision for fa4 351.40 7.4e+02
## Precision for fa5 4.28 2.0e+00
## Precision for fa6 26.27 2.0e+01
## The following uncertainties are much better than in model 4
fit5$summary.random$fa1[, 1:3]
## ID mean sd
## 1 F1level1 -0.070 0.057
## 2 F1level2 0.025 0.057
## 3 F1level3 0.023 0.056
## 4 F1level4 -0.029 0.057
## 5 F1level5 0.040 0.056
## 6 F1level6 -0.033 0.057
## 7 F1level7 0.044 0.057
## The following uncertainties are better than in model 4
fit5$summary.random$fa2[, 1:3]
## ID mean sd
## 1 F2level1 -1.2 0.61
## 2 F2level2 1.2 0.61
We fit all the covariates and factors at the same type, with dynamical priors for the factors. The comments about numerical problems and warnings from the last section apply here too.
form4 = y ~ 1 + X + f(fa1, model="iid", hyper=hyper.iid) + f(fa2, model="iid", hyper=hyper.iid) + f(fa3, model="iid", hyper=hyper.iid) + f(fa4, model="iid", hyper=hyper.iid) + f(fa5, model="iid", hyper=hyper.iid) + f(fa6, model="iid", hyper=hyper.iid)
fit6 = inla(form4, family="gaussian", data = c(data1, list(X=X)),
control.inla = list(int.strategy="eb"),
control.fixed = list(prec=100))
Continuous covariates:
## Total RMSE
sqrt(sum((betas - fit6$summary.fixed$mean[-1])^2))
## [1] 0.045
## Is the estimate within two std errors?
table((abs(betas - fit6$summary.fixed$mean[-1]))<fit6$summary.fixed$sd[-1]*2)
##
## FALSE TRUE
## 1 26
## Quantiled model sd
quantile(fit6$summary.fixed$sd[-1])
## 0% 25% 50% 75% 100%
## 0.0096 0.0097 0.0133 0.0136 0.0248
Factors:
print(paste("Effect on level", effect.level[2], "is", effect.size[2],
"compared to baseline 0"))
## [1] "Effect on level 2 is 2.32970852456796 compared to baseline 0"
fit6$summary.random$fa2[, 1:3]
## ID mean sd
## 1 F2level1 -1.2 0.83
## 2 F2level2 1.2 0.83
This was the first round of modelling. We can look at the results and find ways to proceed.