The Forest Black Box
Random forests are considered to be black boxes, but recently I was thinking what knowledge can be obtained from a random forest? The most obvious thing is the importance of the variables, in the simplest variant it can be done just by calculating the number of occurrences of a variable.
Black Panther Buzz: Easily One of the. The Forest offers Natalie Dormer a few chances to showcase her range in a dual role. Box Office: Ride Along 2.
The second thing I was thinking about are interactions. I think that if the number of trees is sufficiently large then the number of occurrences of pairs of variables can be tested (something like chi square independence). The third thing are nonlinearities of variables. My first idea was just to look at a chart of a variable Vs score, but I'm not sure yet whether it makes any sense. Added Motivation I want to use this knowledge to improve a logit model. I think (or at least I hope) that it is possible to find interactions and nonlinearities that were overlooked.
@TomekTarczynski that's an interesting problem and similar to one I'm dealing with right now. I assume by 'logit model' you mean logistic regression or something similar? I'm using lasso logistic regression (from the glmnet R package) to select predictors from a model with interactions between all pairs of variables. I haven't added in any nonlinear terms yet--but in principle that should be possible too. The only issue I guess is deciding what nonlinear terms to try (polynomial terms, exponential transforms, etc?). Also, I'm not picking up any higher-order interactions but that's easy too. – Jan 25 '12 at 13:23 •.
@Tomek, what are you not getting from this answer? If you are using the randomForest package in R then the plots Zach describes should be very useful. Specifically, you could use varImpPlot for feature selection in your logit model and partialPlot to estimate the type of transformation to try on continuous predictors in the logit model. I would suggest that the latter plot be used to determine where nonlinear relationships between predictor and response exists and then allows you to make that transformation explicitly or to use a spline on that variable. – Jan 25 '12 at 14:14 •.
To supplement these fine responses, I would mention use of gradient boosted trees (e.g. In R, I prefer this to random forests because missing values are allowed as compared to randomForest where imputation is required. Variable importance and partial plots are available (as in randomForest) to aid in feature selection and nonlinear transformation exploration in your logit model. Further, variable interaction is addressed with Friedman’s H-statistic ( interact.gbm) with reference given as J.H. Friedman and B.E.
Popescu (2005). “Predictive Learning via Rule Ensembles.” Section 8.1. A commercial version called TreeNet is available from Salford Systems and this video presentation speaks to their take on variable interaction estimation. Windows 7 Supreme Edition Sp1 X64 Based.
Late answer, but I came across a recent R package forestFloor (2015) that helps you doing this 'unblackboxing' task in an automated fashion. It looks very promising!
Library(forestFloor) library(randomForest) #simulate data obs=1000 vars = 18 X = data.frame(replicate(vars,rnorm(obs))) Y = with(X, X1^2 + sin(X2*pi) + 2 * X3 * X4 + 1 * rnorm(obs)) #grow a forest, remeber to include inbag rfo=randomForest(X,Y,keep.inbag = TRUE,sampsize=250,ntree=50) #compute topology ff = forestFloor(rfo,X) #ggPlotForestFloor(ff,1:9) plot(ff,1:9,col=fcol(ff)) Produces the following plots: It also provides three-dimensional visualization if you are looking for interactions. As mentioned by Zach, one way of understanding a model is to plot the response as the predictors vary. You can do this easily for 'any' model with the R package.
For example library(randomForest) data. I'm very interested in these type of questions myself. I do think there is a lot of information we can get out of a random forest. About Interactions, it seems like have already tried to look at it, especially for classification RFs. To my knowledge, this has not been implemented in the randomForest R package.