Recently Published
Improved Classification with SuperLearner
Imagine there is a jar of marbles and each person is asked to guess how many marbles are in the jar. Some individuals will guess over and some will guess under. What happens if you average the guesses? As technology
has advanced, so have statistical models. There are now multiple choices of models for classification including: logistic regression, support vector machines, discriminant analysis, classification trees, etc. The issue arises
when deciding which one to use and which tuning parameters to select for each given model. A solution that
may improve the classification rates is to combine all predictions into one, more improved prediction (Steinki and Mohammad 2015). This is the basis idea behind ensemble modeling. In class, we discussed a variety
of ensemble model methods including: model averaging, bagging, random forests, boosting, bumping, and stacking. The focus of this report is on stacking. I will introduce an R package, SuperLearner, and fit a few
base models using classification methods we learned in class and evaluate the area under the curve (AUC) of the receiver operating characteristic (ROC). A ROC curve is a plot of the true positive rate (sensitivity)
against the false positive rate (1-specificity) for the different possible cutpoints of a diagnostic test. This shows the tradeoff between sensitivity and specificity. To compare models, we calculate the AUC, a higher
AUC indicates a better fit. After fitting the base models, I will combine the base learners into a super learner and compare the final model to the base learners.