Recently Published
STM1001 Lecture 6 (Data Science Stream)
This is the lecture for the Functions topic in the Data Science stream
Predicting Concrete Strength: A Multivariate and Logistic Regression Approach to Classifying Compressive Strength Outcomes
We wanted to understand if given the ingredients of concrete, could we accurately predict if the resultant compressive strength of that concrete would meet industry standards (4000 PSI).
First, we performed EDA to refine a multi-linear regression model. Then, we took our model and our engineered term of above or below 4000 PSI to train a subset of our data and assess the accuracy against a testing subset.
Our final model came out at 87% accurate. Below, you can find our interpretation of this value.
Interesting insights and limitations:
Some variables of concrete are not ‘necessary’ but are rather additives that can strengthen concrete by enhancing the effects of more primary ingredients like cement. These types of elements (like Superplasticity, Slag, and Fly Ash) have strong interactivity with other concrete ingredients to improve the overall strength. Once we added interactive terms to our multi-linear regression for these ancillary ingredients with more primary ingredients, the model R2 value improved by 15%
Some variables have a direct effect on the strength of concrete without any interactive term (or added ingredients). For example, cement content correlates closely with concrete compressive strength. This is evident in the low p-value from the model summary, and a simple scatter plot between the two variables.
Our logistic regression model had an accuracy of 87%. In other words, if someone has the ingredients to make concrete and plugs those values into our model, our model will predict whether the resultant compressive strength is above or below 4000 PSI. 87% of the time, our model will accurately predict if the concrete strength is above or below that threshold. These types of logistic regression models are likely widely used in the real-world. If concrete unexpectedly fails, the consequences can be severe.
While decent, our model could likely be improved. We estimate that more time would be needed to determine exact interactive terms between variables. In this model, we managed to capture a few obvious ones from some of our diagnostic plots and EDA. Given how those interactive terms improved our model accuracy, more refined ones may further improve this model.