In my previous blog post, Random Forest: An intuitive and an analytical introduction (part 1: Decision trees), I discussed the theory behind decision trees and was deliberately technical because I believe that understanding the methods used is a key to making good predictions. It is however not enough to have a theoretical knowledge of a particular technique, you also need to get your hands dirty and apply it. This is why I decided to write a practical example before digging deeper in the math behind random forest models in the second part of Random Forest: An intuitive and an analytical introduction
A very important issue in health-care is to predict future outcomes, events or risks for individuals experiencing specific symptoms. Identifying individuals in the early stages of an illness enables health-care professionals to rapidly take measures that will hopefully alter potentially fatal outcomes and/or lead to a successful recovery for patients experiencing discomfort. I should immediately point out that I am not a medical expert and that what is described below is simply a fictive example of how a random forest models could be applied by health-care to guide their patients or junior staff.
The internet is loaded with sites describing symptoms of heart disease (or any other disease for that matter) that individuals shoul be aware of and it is not uncommon for people experiencing some discomfort to seek that kind of information. In many cases they do find valuable information that will help them make a rational decision to stop worrying of to call 911. In some cases, their worry might just increase and they end up in the ER thinking they have the early signs of a heart attack just to be told that maybe they shouldn’t have eaten that extra burrito with jalapenos and drank those fantastic four margaritas. The assessment in the ER was rather easy since the 25-year old, well-built and athletic man described his meal, was asked a series of questions that immediately excluded any risks for heart failure.
Now, what if there existed a self-assessment tool online that immediately informed citizens of their risk of having the first signs of heart failure? Can it be built, and can it be considered reliable? Assume that we have managed to gather sufficiently many of the symptoms common to heart failure and that we have, by exploring large amounts of data, ranked them by their importance (i.e., we have determined their weights). Suppose then that we have come up with the folowing 15 questions, which I will denote to ask patients:
- : Do you experience chest discomfort?
- : Do you experience nausea, a feeling of indigestion, heartburn, or stomach pain?
- : Do you feel pain that spreads to the arm?
- : Do you feel dizzy or lightheaded?
- : Do you experience throat or jaw pain?
- : Do you easily get exhausted?
- : Do you snore?
- : Do you experience sudden sweating?
- : Do you have a cough that won’t quit?
- : Are your legs, feet, and ankles swollen?
- : Do you experience irregular heart beat?
- : Do you have anxiety?
- : Do you have a low-grade fever?
- : Do you experience heart palpitations (a sudden pounding, fluttering, or racing feeling in the heart)?
- : Have you had a heart attack in the past?
to be able to grade the severity of the individual’s experience of the symptoms I have assumed that the response of the patient is given on a scale 1-6, 1 being a light symptom while 6 is supposed to be severe. Actually, the answers may be of different types but can be translated on a 6 level scale when preparing the data for modelling.
Suppose that we also have determined the weights of every symptoms as given in the following table:
It could very well be so that some combinations of these symptoms are indications of other heart diseases but not of a heart failure. In this example we will assume that we are in the presence of four different diagnosis given by subsets of , one of which is heart attack. Now, symptoms per se are good predictors of an event but we know that other factors need to be taken into consideration. The age and gender of the individual are also indicators, just as life habits and their consequences. In the data set that I have generated I have also taken into account the following factors:
- BMI (Body Mass Index)
- Family history of heart disease (If one or both parents have had heart problems)
- Comorbidity (the presence of other diseases)
There are obviously many other questions that could be asked but I chose to restrict my presentation to the most obvious parameters. Smoking has been shown to decrease oxygen to the heart, increasing blood pressure and heart rate, increase blood clotting and damage cells that line coronary arteries and other blood vessels. A higher BMI also increases the risks of heart disease (Body mass index, waist circumference, and risk of coronary heart disease: a prospective study among men and women) and genetic markers transmitted by one or both parents may incrrase the risk of coronary diseases.
Clinical studies have been conducted to assess the risks of different heart conditions in individuals but I have unfortunately not had access to it. By investigating this further and through tedious processess, I would probably be able to obtain the information but the point of this blog being to describe how random forest models could be constructed I have not engaged in this time consuming process. Instead, I chose to generate a dataset by using different rules involving the indivduals’ characteristics such as gender, life habits, BMI and family history. I have thereby assumed that my different symptoms are strongly correlated to these facts about patients, while other aren’t. Now, I am aware that this method of generating data is naive and that it often generates set of highly correlated variables but as the corrgram below indicates, I managed to avoid this, at least to some extent.
To simplify things I assumed that we are only in the presence of four distinct diagnosis:
- Diagnosis 1, given by
- Diagnosis 2, given by
- Diagnosis 3, given by
- Heart Failure, given by all symptoms
Using all the rules, weights and characteristics, I assigned Low, Middle and High risks for each given diagnosis and ended upp with the following dataset in which I have hidden most variable:
Notice that all rows containing zero for the variables are given as heart failure individuals. These are the only rows for which the none of the variables take the value 0. Also note that, for instance row 2 (individual 2), have . This means that given the values entered in the individual could be suffering any of the diagnosis 1 to 3. Given the scale 1-6, I set up a rule assigning a probability to each diagnosis. Hence, individual 2 is most probably suffering od diagnosis 3 rather than 1 or 2.
Given the dataset that we have we are now ready to construct our random forest model. I chose to use a quite common package in R, namely randomForest. One on the first steps is to choose which variables to include in the model, as predictors and targets. The target here is quite obviouslt the risk, Risk_QUANT in the above table, i.e., the individuals risk for a given diagnosis. We do not want the individuals identity to be a predictor but as for all other variable it stronglt depends on the situation. I simply chose to include all variables as predictor except those that I created as categorical variables.
Randomized_patients = Randomized_patients[,-c("IND","MostProbableDiagCode")]
The chosen model need to be trained on a subset of the dataset and validated on the remaining set. We therefor need to determine a training set and a validation set in the following way:
sample_size = floor(0.90*nrow(Randomized_patients)) train_ind = sample(seq_len(nrow(Randomized_patients)), size = sample_size) Randomized_patients_train = Randomized_patients[train_ind, ] Randomized_patients_test = Randomized_patients[-train_ind, ]
We chose to use 90% of our original dataset Random_patients as a training set simply because of the relative small size of the dataset (just 1500 patients). In many cases, 60-70% is enough. An important step is to check that the distribution of the target variable in the original and training set is preserved.
As we can see, the proportions of every RiskLevel (in Risk_QUANT) are fairly well preserved and we may therefore proceed with our model. The R-package randomForest is quite comprehensive and the construction of a model requires one single line of code
fit = randomForest(Randomized_patients_train$RISK_QUANT~. ,data = Randomized_patients_train, ntree = 200, importance=T)
in which we indicate that the data to be fitted is the training dataset and that the target variable RISK_QUANT is to be fitted all the variable we chose for the model. The parameter ntree is simply the number of trees that should be created.
Two things can be noted:
- The model determines by itself whether it is a regression problem or a classification problem. This can be seen by calling the model in R:
- I chose to set the parameter importance to TRUE in the model. The resason I did this is that I wish to determine which variables contribute to the assignment of a risk level. I already know this since I created the dataset by assigning a number of weights (predetermined) ans a set of rules. in reality, this is not known and it can be very informative to have access to this information. This can easily be achieved by the following line of code:
var.imp = data.frame(importance(fit,type=2)) var.imp$Variables = row.names(var.imp) var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),] par(oma=c(5,7,1,1)) varImpPlot(fit, sort = T, main="Variable Importance", n.var=20)
giving the following plot
The last two steps are to use the model to predict the response variable on bothe the training and test sets and to check the accuracy of the model using Confusion matrices.
library(e1071) library(caret) # Predicting response variable Randomized_patients_train$predicted.response <- predict(fit ,Randomized_patients_train) # Create Confusion Matrix confusionMatrix(data=Randomized_patients_train$predicted.response, reference=Randomized_patients_train$RISK_QUANT) Randomized_patients_test$predicted.response <- predict(fit ,Randomized_patients_test) # Create Confusion Matrix confusionMatrix(data=Randomized_patients_test$predicted.response, reference=Randomized_patients_test$RISK_QUANT)
The accuracy of the model on the training set is 1, that is that there exists no misclassification of the predicted risks. This is not very surprising since the model is built on a fictive dataset that was not really random (given its rules and characteristics). As for the predictions made on the test set I reached an accuracy of 96% which mean that there were some misclassifications but that I can still be quite confident about the performance of my model.
As heart failure and other heart related diseases are involved in this model, I would not implement it and risk misclassifications, but I am quite confident that other areas of medicin (among many other areas that need to make predictions) could benefit of such a model given the right inputs and sufficient reaseach on the subject. One area that I have already described in a previous blog is the prediction of air bourne particle level.
This blog also serves as an introduction to the second part of my blog Random Forest: An intuitive and an analytical introduction (part 1: Decision trees) in which I will discuss the mathematics of random forest models. As you might have noticed, the single line creating the above model gives not information about the inner workings of the model.