Recently Published
Support Vector Machines
Support Vector Machines provided a robust method for classifying purchase behavior. While the linear kernel was simple and interpretable, the radial kernel captured more complex relationships in the data. Cross-validation helped prevent overfitting and select optimal model parameters.
Tree- Based Methods
Tree-Based Methods in R
HW 5
Sure! Here's a concise 4-sentence summary:
Linear model selection involves identifying a subset of predictors that best explains the response variable, balancing model complexity and predictive accuracy. Methods like best subset selection, forward selection, and backward elimination help choose the most relevant variables. Regularization techniques such as ridge regression and the lasso improve model performance by introducing a penalty on the size of coefficients to reduce overfitting. While ridge shrinks all coefficients toward zero, the lasso can force some to be exactly zero, thus performing variable selection as well.
Sampling Techniques
Sampling techniques in machine learning help evaluate model performance by partitioning data in different ways. **Leave-One-Out Cross-Validation (LOOCV)** trains on all but one observation, repeating for each, while **k-Fold Cross-Validation** divides data into k subsets, training on k-1 and testing on the remaining fold. The **Validation Approach (Train-Test Split)** randomly splits data into training and testing sets, offering simplicity but higher variance. **Bootstrap Resampling** draws multiple samples with replacement to estimate model uncertainty, useful for small datasets but prone to overfitting.
Linear Regression and Logistic Regression Models
We began by cleaning the data, handling missing values, removing outliers, and formatting categorical variables. Through exploratory data analysis, we used visualizations and summary statistics to understand variable distributions and relationships. We fit a linear regression model to predict a continuous outcome and evaluated it using metrics like R-squared . For the logistic regression model, we predicted a binary outcome, interpreting odds ratios
Classification
In this analysis, multiple classification models, including K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Generalized Linear Models (GLM), Quadratic Discriminant Analysis (QDA), and Naive Bayes, are applied to various datasets. The goal is to explore and compare the performance of these models in predicting categorical outcomes. The analysis involves data preprocessing, model training, and evaluation of prediction accuracy using metrics like confusion matrices. Each model's strengths and limitations are discussed in the context of the datasets used.