One of the biggest challenges in handling data is the presence of measurements that makes it difficult to construct a model that possesses a satisfying goodness of fit or, more importantly, a reliable goodness of prediction. This can range from measurements in production data to individuals responding to surveys. A question that immediately poses itself is “what is an outlier?”. Indeed, technically, an outlier is defined as a point that deviates substantially from the rest of the data, but it may sometimes be quite a challenge to determine whether a point is “bad” or “good”, whether it should be rejected or not. In any case, not taking care of outliers can lead to bad data modelling and to erroneous conclusions. In other cases, it might even be the outliers that are of interest and they cannot for this reason be eliminated. Developing an algorithm for detecting fraud in banking or social welfare is just the detection of these outliers, transactions that do not fit normal patterns. So, the moral of all this is that one needs to understand the task at hand, define what is meant by an outlier in the specific setting and finally choose the adequate manner to handle these.
We start by giving two examples, one in which it is unsure whether the outliers should be dropped or not and one in which the outliers are the what needs to be detected and studied.
Example 1: Tall individuals have higher salaries than shorter one
This claim sounds provocative but we reassure the reader that it is only made for the sake of this example. Some might argue that “yes! Tall people are more confident than others and therefore succeed better in life” but as far as we know, there is not real research done on the subject. So, let’s assume that we have randomly picked a number of individuals, measured their length and asked them what their monthly earnings are.
LengthSalary = data.frame( Length = c(93.5, 180.5, 180.0, 190.0, 200.5, 216.5, 171.5, 190.5, 191.5, 173.0, 188.0, 175.0, 179.0, 209.5, 189.0, 185.5, 186.5, 189.0), Weight = c(48 700, 39 300, 90 000, 80 000, 80 000, 64 600, 33 300 43 800 47 000, 35 000, 45 000, 35 800, 36 400, 59 000, 44 300, 43 800, 44 200, 42 500) )
A plot of these reveals the following:
Now, the plot above seems to corroborate the assumption that there is a linear relationship between the length of an individual and her salary…except for three points. How do we treat these points? This is a tricky question, because the result of our study could be flawed if we choose to eliminate these points to simply fit our assumption or reach erroneous results on the strength of this relationship if we keep them. Let us see how the results of a linear regression look like for each of these cases.
We start by leaving the three “outliers” and perform a linear regression, which gives us the following plots and regression results:
> LengthSalary.mod1 = lm(Salary ~ Length, data = LengthSalary) > abline(LengthSalary.mod1, col="red") > summary(LengthSalary.mod1 Call: lm(formula = Salary ~ Length, data = LengthSalary) Residuals: Min 1Q Median 3QMax -8691-6164-5695-513744860 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -75875.260294.3-1.258 0.2263 Length672.3319.7 2.103 0.0516 . Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 15530 on 16 degrees of freedom Multiple R-squared:0.2166,Adjusted R-squared:0.1676 F-statistic: 4.424 on 1 and 16 DF,p-value: 0.05162
We immediately see that there is, according to the model, no statistical significance between salaries and length. But is this really true? Let us see what happens when these points are removed. A plot reveals another story:
> LengthSalary.mod1 = lm(Salary ~ Length, data = LengthSalary) > abline(LengthSalary.mod1, col="red") > summary(LengthSalary.mod1) Call: lm(formula = Salary ~ Length, data = LengthSalary) Residuals: Min 1Q Median 3QMax -2434.8-318.6 374.0 772.81266.9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -84760.675007.13-16.93 3.08e-10 *** Length 686.2226.59 25.80 1.49e-12 *** Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1229 on 13 degrees of freedom Multiple R-squared:0.9808,Adjusted R-squared:0.9794 F-statistic: 665.8 on 1 and 13 DF,p-value: 1.495e-12
What are we to do in cases like this one? If the data we own is reduced to simply information about length and salary, then scientific honesty requires one to give both results with a discussion about the poorness of the survey. Indeed, we have omitted to ask every individual their age and occupation.
Example 2: Government fraud
In the previous example, the presence of outliers led to an impossibility to reach a conclusion about a simple relationship observed in the data. In other cases, the deviant data is just what one is looking. Image a governmental allocation meant to subsidies the population’s dental care. Dentistry is notoriously expensive and people avoid seeking care because they feel they cannot afford it. To remedy this problem the state pays, through taxes, for 90 percent of the care performed by dentists. The system set up functions as follows: Patients meet their doctor, the doctor performs the necessary care, registers what has been done and the state reimburses the dentist for all care costing less than a certain amount of money. Thus, the patient needs only to pay 10 percent of the actual cost of care.
As this seems to be a wonderful thing, it does open for many fraudulent activities such as performing care that was not needed, demanding reimbursement for patients that were never there or consequently choosing expensive operations…among many others.
To illustrate how deviant behavior is detected we have created a fictive set of 10 000 dentists that under a week have registered performed care. To simplify the task, we have selected 8 different operations labelled Job1 through Job 8. Now, different tasks take different amounts of time and given the fact that there are a limited number of hours during which an individual can work, it seems reasonable to determine whether the total reported number of hours worked by dentists makes sense. This, among many other indicators, may be an indication of fraud. In a reality, one would design a number of indicators and construct a score-card model in which every indicator has a given weight and the aggregated score would determine the probability of a given individual engaging in fraudulent activities.
The table bellows shows the ten first IDs and the number of registered number of time each job has been registered by each dentist.
>head(DataFull, 10)
ID Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 1 1 41 17 14 40 36 56 88 12 2 2 40 23 7 39 37 56 92 8 3 3 36 20 11 40 32 54 87 11 4 4 34 19 17 40 35 56 87 12 5 5 38 22 18 40 36 56 82 10 6 6 42 19 16 40 31 56 90 11 7 7 43 17 12 41 33 55 90 10 8 8 37 19 15 39 35 57 91 12 9 9 41 19 13 41 36 56 89 11 10 10 42 20 14 39 35 55 97 9
To be able to have some sense of extent to which each individual dentist works we need to know the duration of each job. We have the following table:
Time 15min 30min 45min 75min 35min 10min 120min 30min
Job Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
The resulting table being
ID Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Sum (min) 1 1 615 510 630 3 000 1 260 560 10 560 360 34 992 2 2 600 690 315 2 925 1 295 560 11 040 240 35 334 3 3 540 600 495 3 000 1 120 540 10 440 330 34 136 4 4 510 570 765 3 000 1 225 560 10 440 360 34 868 5 5 570 660 810 3 000 1 260 560 9 840 300 34 010 6 6 630 570 720 3 000 1 085 560 10 800 330 35 402 7 7 645 510 540 3 075 1 155 550 10 800 300 35 164 8 8 555 570 675 2 925 1 225 570 10 920 360 35 616 9 9 615 570 585 3 075 1 260 560 10 680 330 35 368 10 10 630 600 630 2 925 1 225 550 11 640 270 36 960
Though we only show the first 10 rows of the data we observe that there doesn’t seem to be of any particulary striking about the data, that is, no cumulated time for any particular Job and dentist sticks out as beeing too large. The same is true about most of the data in the dataset. So, if any of them is cheating the system, how do we find it? There are as many methods to detect the unsual patterns as there are ways to fool the system. R offers a number of packages such as “outlier” or “mvoutlier“, but in this particular case, I believe that using Cook’s distance is a more appropriate approach. Now, how does that work? If we recall the previous example, we tried to fit a linear regression model to a set of points. As we observed data points with large residuals (what we here call outliers) may influence the outcome and accuracy of a regression in a negative way. Cook’s distance is the measure the effect of deleting a given observation on the result of the regression and the points with a large Cook’s distance are the points that we should consider as abnormal given the general behavior of our dentists. I encourage the readers to pick almost any books in statistics to undertand the mathematical workings of the method. We shall here concentrate on the practical ways to use this to quickly obtain a list of potential fraudsters. To simplify our example we focus on one of the tasks performed by the 10 000 professionals, Job8. We do this because we have for instance been given indications by patients and experts that it is an operation that is both costy and difficult to prove it hasn’t been performed. We decide therefore to investigated how Job8 influences the total sum of labored minutes per dentist, that is, if there is a relation between the total number of minutes worked and Job8.
mod= lm(Sum ~ Job8, data=Full.Time.work) cooksd = cooks.distance(mod) CooksDist= as.data.frame(cooksd) CooksDist.id = cbind(ID,CooksDist)
The last line gives us the Cook’s distances for every dentist. It can be quite a challenge to determine which Ids are worth considering for further analysis. We therefore need a little more information, or rules, to focus on relevant data.
plot(cooksd, pch="+", cex=1, main="Influential observations by Cooks distance") abline(h = 4*mean(cooksd, na.rm=T), col="red") text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd > 4*mean(cooksd, na.rm=T), names(cooksd),""), col="red")
As in many other areas of statistical analysis, there are bounds that are given as general thumb rules to be followed. When it comes to the influence of data points to a linear model, one unsual considers points with a Cook’s distance 4 times the mean as points influencing a regression. We adopt that rule but can after having investigated these particular individuals change the parametes.
The above code gives the following plot, where the points having a Cook’s distance 4 times the mean are marked as red, the identification of which is given below the graph.
Outlier.ids= as.vector(which(CooksDist.id$cooksd >8*mean(cooksd))) Outlier.ids 6888 6918 6931 6951 6998 7014 7034 7055 7056 7086 7087 7127 7128 7174 7190 7266 7269 7437 7529 7552 7577 7584 7610 7616 7638 7746 7772 7788 8095 8170
The same procedure can be reiterated for any number of Jobs (i = 1, 2, …,8) or for combinations, depending on the setting. Cook’s distance gives an indication of which data points may need to be considered. They are however no proof of fraud….this step demands further, more investigatory, work. But in organisations where millions of transactions are performed daily, designing statistical methods to isolate “unusual” events are essential and cheap. Investagating every transactions is too costy and randomly pick transactions most often gives too few satisfactory results.
In conclusion, the detection of outliers is essential to a good understanding of data and determining what the aim of an analysis is also determines how outliers should be handled and delt with.