
Variable importance in random forests
20160222
SourceConsider we run a random forest on \(n\) independent realizations of a random vector \((X_1,X_2,X_3,Y)\), assuming \(Y\) is a numerical response variable.
The theoretical classifier is the function \(f\) such that \[ E[Y\mid X_1, X_2, X_3]=f(X_1, X_2, X_3). \]
A random forest also returns a socalled variable importance \(\hat I_j\) for each predicting variable \(X_j\). I am going to explain what is the variable importance.
I take \(j=1\) not to be invaded by the notations. In the paper “Correlation and variable importance in random forests” (Gregorutti & al), it is shown that the variable importance \(\hat I_1\) of \(X_1\) goes to the population variable importance \[ I_1 = E\left[{\bigl(Yf(X'_1,X_2,X_3)\bigr)}^2\right]  E\left[{\bigl(Yf(X_1,X_2,X_3)\bigr)}^2\right] \] where \(X'_1\) is a random variable having the same distribution as \(X_1\) but is independent of all other random variables \(X_2,X_3,Y\). It is always a nonnegative number. Roughly speaking, \(X_1\) has a high importance \(I_1\) if the prediction error has a high increase when one breaks the link between \(X_1\) and \(Y\).
The convergence \(\hat I_j \to I_j\) shown by Gregorutti & al was an expected result. The variable importance \(\hat I_j\) of the \(j\)th predictor \(X_j\) is defined as follows. For each tree \(t\) of the random forest, there’s a classifier \(\hat f_t\). The mean squared error \(MSE_t\) in tree \(t\) is the mean squared prediction error in the outofbag (OOB) sample of tree \(t\). The \(j\)th perturbed mean squared error \(MSE_t^{(j)}\) is defined similarly after randomly permuting the values of \(j\)th variable in the OOB sample. Finally \(\hat I_j\) is defined as the average difference \(\overline{MSE}^{(j)}  \overline{MSE}\) over all trees.
Let us check this convergence with the
randomForest
package. I will take \[ f(X_1, X_2, X_3) = X_1 + X_2X_3. \] This function \(f\) is of the form \[ f(X_1, X_2, X_3) = f_1(X_1) + f_{23}(X_2X_3). \] It is shown in Gregorutti & al’s paper that \(I_1= 2Var(f_1(X_1))\) in such a case.Before running the random forest, we will check this result and evaluate \(I_2\) and \(I_3\) with the help of simulations. The distribution of my random vector \((X_1,X_2,X_3,Y)\) can be seen on these simulations:
set.seed(666) N < 25000 x1 < rnorm(N) x2 < x1 + rnorm(N) x3 < x1 + x2 + rnorm(N) f < function(x1, x2, x3) x1 + x2*x3 sigma < 1 y < f(x1, x2, x3) + rnorm(N, sd=sigma)
Thus \(E\left[{\bigl(Yf(X_1,X_2,X_3)\bigr)}^2\right] = \sigma^2\) with \(\sigma=1\).
The evaluation of \(I_1\) with the help of simulations indeed returns a result close to \(2Var(X_1)=2\):
xx1 < rnorm(N) ( I1 < mean((yf(xx1,x2,x3))^2)  sigma^2 ) ## [1] 1.989307
One finds \(I_2 \approx I_3 \approx 40\):
xx2 < xx1 + rnorm(N) ( I2 < mean((yf(x1,xx2,x3))^2)  sigma^2 ) ## [1] 40.33887 xx3 < xx1 + xx2 + rnorm(N) ( I3 < mean((yf(x1,x2,xx3))^2)  sigma^2 ) ## [1] 41.33211
Now let us try the random forest on the first \(n\) simulations with \(n=300\).
XY < data.frame(x1, x2, x3, y)[1:300, ]
I firstly tune the
mtry
parameter with thetrain
function of thecaret
package. Recall thatmtry
is the number of variables selected at random for the splitting in each tree. Here it can be \(2\) or \(3\) since there are only \(3\) predictors.library(caret) ( training < train(x=XY[,1:3], y=XY$y, tuneLength=2) ) ## Random Forest ## ## 300 samples ## 3 predictors ## ## No preprocessing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 300, 300, 300, 300, 300, 300, ... ## Resampling results across tuning parameters: ## ## mtry RMSE Rsquared RMSE SD Rsquared SD ## 2 1.898910 0.8896097 0.5516993 0.04221964 ## 3 1.933819 0.8817003 0.5185210 0.04466596 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was mtry = 2.
The selected value of
mtry
was \(2\). Now I run the random forest, requiring the importances with the optionimportance=TRUE
:library(randomForest) rf < randomForest(y ~ ., data=XY, importance=TRUE, mtry=training$finalModel$tuneValue$mtry)
The variable importances \(\hat I_j\) are provided by the
importance
function as follows:importance(rf, type=1, scale=FALSE) ## %IncMSE ## x1 1.241496 ## x2 22.753704 ## x3 20.228597
We could conclude they are not very close to their theoretical values. But the ratios \(\hat I_j / \hat I_{j'}\) are rather good estimates of their theoretical values.