cv.nfeaturesLDA()

This function provids an illustration of the process of finding out the optimum number of variables using k-fold cross-validation in a linear discriminant analysis (LDA).

For a classification problem, usually we wish to use as less variables as possible because of difficulties brought by the high dimension.

The selection procedure is like this:

Split the whole data randomly into \(k\) folds:
For the number of features \(g = 1, 2, \cdots, g_{max}\), choose \(g\) features that have the largest discriminatory power (measured by the F-statistic in ANOVA):
For the fold \(i\) ( \(i = 1, 2, \cdots, k\) ):
Train a LDA model without the \(i\)-th fold data, and predict with the \(i\)-th fold for a proportion of correct predictions \(p_{gi}\);
Average the \(k\) proportions to get the correct rate \(p_g\);
Determine the optimum number of features with the largest \(p\).

Note that \(g_{max}\) is set by ani.options('nmax') (i.e. the maximum number of features we want to choose).

library(animation)
ani.options(nmax = 10)
par(mar = c(3, 3, 0.2, 0.7), mgp = c(1.5, 0.5, 0))
cv.nfeaturesLDA(pch = 19)

## Loading required namespace: MASS

plot of chunk demo-a

This animation provides an illustration of the process of finding out the optimum number of variables using k-fold cross-validation in a linear discriminant analysis (LDA).

Cross-validation to find the optimum number of features (variables) in LDA

Yihui Xie & Lijia Yu 2017-04-04