Recently Published
Data Smart, Chapter 3, Tweet Classification
Executive Summary
In chapter 3 of the book, Data Smart, by John Foreman, the author develops a Naive Bayes classifier in Excel to determine whether tweets containing the word 'mandrill' are related to Mailchimps's Mandrill email transaction app or not.
Whereas the author used Excel, we choose to use R's text mining package, tm, in order to take advantage of its automated text processing tools.
The book, Machine Learning for Hackers [https://github.com/johnmyleswhite/ML_for_Hackers] by Drew Conway and John Myles White is also a useful resource. We use elements of that book's approach (chapter 3) to email spam classification using the tm package here.
Data_Smart_Ch04_Optimization
In Chapter 4 of the book Data Smart, by John Foreman, the author uses Excel's linear programming tool, Solver, to solve an optimization problem, specifically, minimizing the raw materials cost for a commercial orange juice blend. Consistent with this series, here we use R to solve the same problem, specifically, we invoke R's lpSolve package.
Data_Smart_Ch09_Local_Outliers
In chapter 9, the author explains and develops the Local Outlier Factor approach to identifying outliers in multi-dimensional data. He does this using Excel, as usual. The problem the author posed was to identify outliers among a group of 400 call center employees given their job performance data.
In chapter 10, he presents the R-code solution for this problem. It requires just five lines of code using the lofactor() function from the DMwR package. The solution shown is the same as the author's except for name changes.
I am posting the solution here because the Local Outlier Factor approach is such an interesting one and because it is effective and easy to comprehend. Enjoy!
Data_Smart_Ch_08_Forecasting
Executive Summary
In chapter 8 of John Foreman's book, Data Smart, he turns to forecasting demand for a fictional replica sword manufacturing business. The author focuses an Exponential smoothing method which takes Trend and Seasonality into account (ETS), known as the Holt-Winters method.
The code to generate the forecast in R is very, very concise and the author provides it in Chapter 10. We try to add value here by provide code for useful supporting and validation information, such as for the generation of the Auto-Correlation Function.
A fantastic online resource for time-series forecasting is the e-text "Forecasting: Principles and Practice" [https://www.otexts.org/book/fpp] by Prof. Rob Hyndman, the author of the R forecast package, and George Athanasopoulos.
Data_Smart_Ch07_Ensemble_Modeling
Executive Summary
In chapter 7 of John Foreman‘s book, Data Smart, he again predicts the pregnancy status of Retail Mart’s customers based on their shopping habits. This time he uses ensemble techniques, specifically bagging and boosting, to build his predictive models.
Since the author also provides the R code for a logistic regression and random forest solution in a later chapter of his book (chapter 10), we take a slightly different approach here. We use the well-known and powerful R caret predictive modeling framework to implement the bagging, boosting and random forest models.
The boosting model proved to have a marginally better AUC score. More importantly, we will see dramatic differences in computation time, with the boosting model being more than 6 times and 13 times faster than the random forest and bagging methods used respectively.
Data Smart, Chapter 6, Logistic Regression
In chapter 6 of the book, Data Smart by John Foreman, the synthesized challenge is to predict which of a retailers’s customers are pregnant based on a dataset of their shopping records.
A logistic regression model is used. The model is trained on the shopping records of 500 pregnant customers and 500 non-pregnant customers. The model is then tested on a dataset of 1000 different customers, each of whose pregnancy status is known.
Data Smart develops a logistic regression solution using Excel. We are free to user R to solve the problem and we do so using the code here. The dataset is available at the book's website.
Data Smart, Chapter 5, Network Graphs and Community Detection
Chapter 5 of John Foreman‘s book Data Smart looks at data which can be arranged as a network graph of related data points. It uses a cluster analysis technique called Modularity Maximization to optimize cluster assignments for the graph data.
We can implement the same process succinctly in R, making use of functions in the R igraph and lsa packages.
Data Smart, Chapter 2, K-medians clustering
This is a walk-through of a customer segmentation process using R's 'skmeans' package to perform k-medians clustering. The dataset examined is that used in chapter 2 of John Foreman's book, Data Smart [http://www.wiley.com/WileyCDA/WileyTitle/productCd-111866146X.html].