RPubs

by RStudio

darrentsumm

Darren

Recently Published

Functional Random Forest

Random Forest is a machine learning model that is centered around decision trees. A decision tree starts with a binary statement (e.g., “Is the patient’s Age >= 45?”) and two leaves are created, one for yes, one for no. This branching continues until the tree reaches a terminal node, which assigns a prediction, either categorical or continuous. A random forest contains a set number of decision trees; for our model, 500. These trees are created randomly by the model, based on the variable we set, mtry, which we often set as 5. This means that each tree will randomly select 5 variables without replacement to create a tree, and use a bootstrap sample of the training data to find the best splitting point, i.e., the number after >=. For RFs with continuous outputs, like Systolic Blood Pressure, the outputs of all 500 trees are averaged together for each subject. Since we modified the functional data using bsplines, each of the 20 basis points counted as a variable in this model.

13 days ago

RNN with GRU

One common issue with Recurrent Neural Networks is exploding or vanishing gradients, where the previous state (h(t-1)) and its weight become extremely high, or more commonly are reduced to practically zero during backpropagation. Since we have a lot of data points (1440), this should be addressed. One way to address this is by implementing a Gated Recurrent Unit, which changes the hidden state equation to include an update gate and a reset gate. The update gate balances how much of the new hidden state to incorporate, compared to the previous hidden state. The reset gate sets how much past information to forget when computing the new hidden state. The rest of this model ran similarly to our True RNN, with 10 epochs, measuring MSE and MAE.

16 days ago

NHANES Vanilla RNN

A Recurrent Neural Network (RNN) is a model that intakes functional data one data point at a time (t), computing a hidden state (h), based on the current input (x), and the previous hidden state, both weighted (W), increased by the learned bias term (b), and applied to the tanh function (σ) calculated by the model. h(t) = σ(W(h) * h(t-1) + W(x) * x(t) + b) Scalar, non-functional data is input as a flat vector and applied to the ReLU function for the simplicity of the relationship. We trained the model using a validation split of 0.2, meaning that the RNN cycles through 80% of the training data, updating the model’s weights at each data point, and then uses the remaining 20% of the training data to measure the performance of the model via mean squared error and mean absolute error. The RNN then adjusts the weights by backpropagation and repeats this process, which is called an epoch. In each epoch, the model uses an “adam” (Adaptive Moment Estimation) optimizer, which does particularly well with noisy or high-dimensional data. Our model typically stopped improving after about 8 epochs, so we set the RNN to run 10 epochs, with the idea that this would be a similar process to 10-fold cross-validation. Variable Importance is found through a permutation-based loop. For each variable, its values were randomly shuffled across all subjects, breaking any true association with the outcome. The modified data was passed through the trained model, and the increase in RMSE was recorded. A larger increase indicates greater importance of that variable.

18 days ago

NHANES fSIR

Functional Sliced Inverse Regression is a dimension reduction technique that separates the training data into H slices, grouping subjects by their output, in this case, Systolic Blood Pressure. Then, the inverse regression of the functional data given Systolic Blood Pressure is observed to create and train the model. The model produces several sufficient predictors that simplify the functional data, with decreasing proportions of relationship. We found that the model performed best using two of these sufficient predictors, and H = 5 slices. Like with all models, we used 10-fold cross-validation, where the data was split into ten folds, then the model would train on 9 of these folds, and test on the other. This process is repeated so that each fold tests once. We also smoothed the functional data using 20 basis functions, consistent with our other models. To measure performance, we trained a linear regression model using the sufficient predictors and scalar variables.