Many companies today have understood the importance of keeping their customers as satisfied as possible. The reasons for this should be quite obvious and given a large number of actors on the market it is of paramount to retain customers. It is a well-known fact that the acquisition of a new customer is far costlier to a company than the effort needed to keep it from leaving, or as it is called, churning. Thus, whenever a customer churns, a company not only loses its investment in recruiting a customer but also future revenues associated to that customer. The term churning refers to the movement of individuals from a group over a given period. Another term used in the literature is Attrition.
But how do you determine whether a customer is about to churn? Are there different ways to do so depending on the type of business? Some businesses provide a service that can be sought from time to time while others demand a long-term commitment from customers in the form of contract of a determined duration. Online shopping services are of the first type and cellphone service providers are of the second type. Naturally, they need to be handled in separate ways because of the data available about the customers.
In this blog, we intend to describe some of the methods used to determine whether a customer is about to churn. This knowledge then enables a business to design strategies that might prevent the loses associated to groups of customers choosing the services of competitors. Of course, the accuracy of such a model is dependent on the amount of data available to the analyst, but today this issue is of a lesser concern since most businesses gather vast amounts of data about its subscribers and of most transactions associated to the latter. So, basically, the real problem is mining the data in the right way, making a good model and predict reliable churn rates.
In this exercise, we will use a fictive, real world based, dataset from telecommunication (The dataset can be found online at http://www.dataminingconsultant.com/data/churn.txt). Several important aspects need to be mentioned before going any further:
- Modelling churn (as with many predictive models) cannot be done without a sufficient amount of historical churn data. What we seek is what costumers who churned had in common or what differentiated them from those that didn’t, and this task cannot be achieved without information about which customers have or haven’t churned.
- People may have widely different reasons to end their contracts with service providers. As a model’s aim is to identify patterns, it is likely to misinterpret some customer’s behavior as possible attrition. The model results are therefore not entirely accurate.
- Once a model has been built and tuned it needs to regularly be updated (retrained). Indeed, costumer behaviors may change over time, either due to, for instance, the service providers changes in pricing, market diversity or technical advances that make the product less attractive.
We shall also present two approaches, Random Forest and Survival analysis, to analyzing and predicting churn. The two methods can be used together, partly to predict churn, partly to predict time to churn.
A random decision forest is an ensemble learning method for classification and regression among a wide range of other tasks. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. We have already described Random forest in a previous blog, https://kentoranalytics.com/blog/2017/3/21/partikelkollense, and invite readers to read it.
As we saw in the previous section, we have the ability to predict customer attrition. As the type of models described here have a reasonable accuracy, they lack one important feature, namely a prediction of when churn will occur. It is of course impossible to determine a precise time of such an event, but it is however possible to determine the probability of an event within a given time interval. To do so, we used so-called Kaplan-Meier estimators and more specifically Greenwood’s formula. KentorAnalytics does not, usually, have the intent to educate its readers in statistics or mathematics. However, to grasp the essence of the Kaplan-Meier estimators, we chose to give a light-version of how one obtain Greenwood’s formula.
Let be the times of either (i) an observed churn (ii) the last time a costumer was censored as such. Let if $latexX_i$ is an observed attrition (churn) and if the -th individual was last seen as a costumer (that is, churn is false) at time , but has not been censored as such since then. The concept of censored and un-censored costumers is essential here because at the time we make prediction, individuals have been members of the studied sample different length of time and it may be difficult to compare them. If ,
the TRUE churn-time for the -th individual is , but the costumer dropped out of the study at time . This means that we perform our analysis not knowing anything about that particular individual’s intentions (stay or go). We say that was censored at time . If , is the observed time of attrition.
Now, the natural question that poses itself is: Can we say ANYTHING about the future attrition of censored and un-censored individuals? That is, can we compute estimates of when individuals will churn and possibly/alternatively give probabilities for customer attrition within a given time frame?
Assume are distinct times arranged in increasing order. The times are observed churn times in the sample data for which and for some . Let be the size of the risk at time . What this means is that is the number of individuals in the sample that had not churned at time . For , where is the number of individuals that churned at time and the number who are censored at times with . The Kaplan-Meier estimator of the survival function S(t)=Pr(t <X) is then given by
Since \(\hat S(t)\) is an estimate we should rather give this point-estimate together with some confidence interval. If the required confidence is (100-α)% then the Greenwood formula is given by
and is the -th quantile of the normal distribution so that if a 95% confidence interval is required, then and .
With this in mind, we finally get to hard work with our data. One question that immediately poses itself is whether on really needs to include all variables in our survival model. This dataset contains very little data compared to what a business might have at hand and in those cases, it might be a smart and cheap move to investigate whether a variable is important to churn or not. This implies shorter, less expensive, computational time. If nothing else, it might help you to understand the data and inspire relevant statistical investigations.
We can graphically investigate the relation between churning and different variables.
names(TeleComData) "State","Account.Length" "Area.Code","Phone","Int.l.Plan","VMail.Plan", "VMail.Message", "Day.Mins","Day.Calls","Day.Charge","Eve.Mins","Eve.Calls", "Eve.Charge","Night.Mins","Night.Calls","Night.Charge","Intl.Mins","Intl.Calls", "Intl.Charge","CustServ.Calls","Churn."
What would your reasons to change service provider? Could it be that customers over time get curious about other service providers? This is apparently not the case, as the figure below shows.
One of the drivers in almost everything is money, of course and a second is the reasons for which one choses a service over another. In the case of telecommunication, we know that charges vary over the course of the day. Daytime hours are often costlier since communication is a must during business hours and users are more price insensitive.
These graphs can be compared to the corresponding graphs of usage in minutes of the services.
There are probably several other leads that may be investigated in order to determine all the reasons for which an individual chooses to leave a service provider in hope to get a better deal elsewhere, but the aim of this article is not an extensive study of one particular case but rather to give an introduction to the techniques usually used. We shall therefore restrict ourselves to o check for all the variables in a dataset, given the right format, one might even what to use linear regression style techniques to observe relationships between churn and these variables. A note of caution though is that one might even want to perform multivariate regressions as well as univariate since there, in some cases, might exit multiple reasons for churning.
To work through our material, we have chosen to use the R implemented survival package (https://cran.r-project.org/web/packages/survival/index.html) which works well for all versions of R older than 2.13.0. It only imports a small number of other packages (graphics, Matrix, methods, splines) and includes definitions of Surv objects, Kaplan-Meier and Aalen-Johansen curves as well as Cox models.
We begin by creation a survival object using each customer’s time in the system, i.e. the time they spent as subscribers of the service and the information of their churn status. When this is done, one needs to fit a survival curve to the data and of course plot it. This we, in the above description chose to concentrate on three aspects, the overall survival of any customer in the system and that of individuals using their services at different times of the day, we create these Surv-object for these three types.
TeleComData$Accountsurvival = Surv(TeleComData$Account.Length, TeleComData$Churn. == "True.") TeleComData$Daysurvival = Surv(TeleComData$TotalDayCharge, TeleComData$Churn. == "True.") TeleComData$Nightsurvival = Surv(TeleComData$TotalEveCharge, TeleComData$Churn. == "True.")
The fitted curves are then given by:
fit = survfit(Accountsurvival ~ 1, data = TeleComData) fit2 = survfit(Daysurvival ~ 1, data = TeleComData) fit3 = survfit(Nightsurvival ~ 1 , data = TeleComData)
The plots are easily done with the following code:
SurvivalPlot = plot(fit, lty = 1:2, mark.time = FALSE, ylim=c(.05,1), xlab = 'Days since Subscribing', ylab = 'Percent Surviving') legend(20, .8, c('yes', 'no'), lty=1:2, bty = 'n', ncol = 1) title(main = "Telecom Survival Curves")
SurvivalDayCharge = plot(fit2, lty = 1:2, mark.time = FALSE, ylim=c(.05,1), xlab = 'Days since Subscribing', ylab = 'Percent Surviving') legend(20, .8, c('yes', 'no'), lty=1:2, bty = 'n', ncol = 1) title(main = "Survival Curves-Total day charge") SurvivalNightCharge = plot(fit3, lty = 1:2, mark.time = FALSE, ylim=c(.05,1), xlab = 'Days since Subscribing', ylab = 'Percent Surviving') legend(10, .8, c('yes', 'no'), lty=1:2, bty = 'n', ncol = 1) title(main = "Survival Curves-Total night charge")
which gives the following plots:
We can observe that we can expect about 50 percent of our costumers to survive the first 200 days of their subscription to our services. Also note the widening of the confidence intervall(here set to 95%).
Compare the graphs for individuals using mostly their subscription for day usage and those using it mostly for night usage (graph below). Note the difference in survival probability.
A final step, that we in this example shall only do for the overall account length, is to perform a log-rank test. The log-rank test compares estimates of the hazard functions of two samples at each observed event time. It computes the observed and expected number of events in one of the samples at each observed event in time and then adds these to obtain an overall summary across all-time points where there is an event.
survdiff(survival ~ Account.Length, data = TeleComData) DataSurv = as.data.frame(cbind(fit$time,fit$surv, fit$upper, fit$lower)) names(DataSurv) = c("Account.Length","surv.prob","surv.prob.upper","surv.prob.lower") TelecomSurvTable = merge(TeleComData,DataSurv, by=c("Account.Length","Account.Length")) TeleComDataSurv = TelecomSurvTable[which(TelecomSurvTable$Churn. == "False."),]
which gives us the probability of each costumer to continue being a subscriber to our services:
State Phone Accountsurvival surv.prob surv.prob.upper surv.prob.lower 1 SC 336-1043 1+ 0.9997 1 0.9991122 2 AK 373-1028 1+ 0.9997 1 0.9991122 3 SC 356-8621 1+ 0.9997 1 0.9991122 4 NJ 420-6780 1+ 0.9997 1 0.9991122 5 IA 331-2144 1+ 0.9997 1 0.9991122 6 TN 335-5591 1+ 0.9997 1 0.9991122 178 SC 359-5091 36+ 0.9932518 0.9960663 0.9904451 179 IA 385-3540 36+ 0.9932518 0.9960663 0.9904451 180 ME 335-3110 36+ 0.9932518 0.9960663 0.9904451 181 MI 400-3637 36+ 0.9932518 0.9960663 0.9904451 182 AK 341-9764 36+ 0.9932518 0.9960663 0.9904451 184 MI 386-1131 37+ 0.9926213 0.9955672 0.9896842 186 CO 408-1513 37+ 0.9926213 0.9955672 0.9896842 187 CT 347-7675 37+ 0.9926213 0.9955672 0.9896842 188 NH 341-7332 37+ 0.9926213 0.9955672 0.9896842 189 NE 393-7892 37+ 0.9926213 0.9955672 0.9896842 190 MD 420-2000 37+ 0.9926213 0.9955672 0.9896842
The analysis done here is quite basic as we only looked at some factors that might influence churning, but it gives a rather intuitive idea of the common procedures used in the context. There are also numerous add-ons to this analysis that may be done. For instance, one could design customer services in such a way that account is taken to the fact that a customer has a high probability of churning, e.g. by moving them forward in queues when services are solicited or sending advantageous offers that might influence their decisions. The end game is after all to keep existing customers and avoiding costs associated to the recruitment of new ones to replace them.