On the importance of outlier detection

One of the biggest challenges in handling data is the presence of measurements that makes it difficult to construct a model that possesses a satisfying goodness of fit or, more importantly, a reliable goodness of prediction. This can range from measurements in production data to individuals responding to surveys. A question that immediately poses itself is “what is an outlier?”.  Indeed, technically, an outlier is defined as a point that deviates substantially from the rest of the data, but it may sometimes be quite a challenge to determine whether a point is “bad” or “good”, whether it should be rejected or not. In any case, not taking care of outliers can lead to bad data modelling and to erroneous conclusions. In other cases, it might even be the outliers that are of interest and they cannot for this reason be eliminated. Developing an algorithm for detecting fraud in banking or social welfare is just the detection of these outliers, transactions that do not fit normal patterns. So, the moral of all this is that one needs to understand the task at hand, define what is meant by an outlier in the specific setting and finally choose the adequate manner to handle these.

We start by giving two examples, one in which it is unsure whether the outliers should be dropped or not and one in which the outliers are the what needs to be detected and studied.

Example 1:  Tall individuals have higher salaries than shorter one

This claim sounds provocative but we reassure the reader that it is only made for the sake of this example. Some might argue that “yes! Tall people are more confident than others and therefore succeed better in life” but as far as we know, there is not real research done on the subject. So, let’s assume that we have randomly picked a number of individuals, measured their length and asked them what their monthly earnings are.

LengthSalary = data.frame(
Length = c(93.5, 180.5, 180.0, 190.0, 200.5, 216.5, 171.5, 190.5, 191.5, 173.0, 188.0, 175.0, 179.0, 209.5, 189.0, 185.5,
186.5, 189.0),
Weight = c(48 700, 39 300, 90 000, 80 000, 80 000, 64 600, 33 300 43 800 47 000, 35 000, 45 000, 35 800, 36 400, 59 000, 44 300, 43 800, 44 200, 42 500)
)

A plot of these reveals the following:

earnings

Now, the plot above seems to corroborate the assumption that there is a linear relationship between the length of an individual and her salary…except for three points. How do we treat these points? This is a tricky question, because the result of our study could be flawed if we choose to eliminate these points to simply fit our assumption or reach erroneous results on the strength of this relationship if we keep them. Let us see how the results of a linear regression look like for each of these cases.

We start by leaving the three “outliers” and perform a linear regression, which gives us the following plots and regression results:

pics

> LengthSalary.mod1 = lm(Salary ~ Length, data = LengthSalary)
> abline(LengthSalary.mod1, col="red")
> summary(LengthSalary.mod1
Call:
lm(formula = Salary ~ Length, data = LengthSalary)
Residuals:
 Min 1Q Median 3QMax
 -8691-6164-5695-513744860
Coefficients:
Estimate Std. Error t value Pr(>|t|) 
(Intercept) -75875.260294.3-1.258 0.2263 
Length672.3319.7 2.103 0.0516 .
Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15530 on 16 degrees of freedom
Multiple R-squared:0.2166,Adjusted R-squared:0.1676
F-statistic: 4.424 on 1 and 16 DF,p-value: 0.05162

We immediately see that there is, according to the model, no statistical significance between salaries and length. But is this really true? Let us see what happens when these points are removed. A plot reveals another story:

No+outliers

> LengthSalary.mod1 = lm(Salary ~ Length, data = LengthSalary)
> abline(LengthSalary.mod1, col="red")
> summary(LengthSalary.mod1)
Call:
lm(formula = Salary ~ Length, data = LengthSalary)
Residuals:
Min 1Q Median 3QMax 
-2434.8-318.6 374.0 772.81266.9 
Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) -84760.675007.13-16.93 3.08e-10 ***
Length 686.2226.59 25.80 1.49e-12 ***
Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1229 on 13 degrees of freedom
Multiple R-squared:0.9808,Adjusted R-squared:0.9794 
F-statistic: 665.8 on 1 and 13 DF,p-value: 1.495e-12

What are we to do in cases like this one? If the data we own is reduced to simply information about length and salary, then scientific honesty requires one to give both results with a discussion about the poorness of the survey. Indeed, we have omitted to ask every individual their age and occupation.

Example 2: Government fraud

In the previous example, the presence of outliers led to an impossibility to reach a conclusion about a simple relationship observed in the data. In other cases, the deviant data is just what one is looking. Image a governmental allocation meant to subsidies the population’s dental care. Dentistry is notoriously expensive and people avoid seeking care because they feel they cannot afford it. To remedy this problem the state pays, through taxes, for 90 percent of the care performed by dentists. The system set up functions as follows: Patients meet their doctor, the doctor performs the necessary care, registers what has been done and the state reimburses the dentist for all care costing less than a certain amount of money. Thus, the patient needs only to pay 10 percent of the actual cost of care.

As this seems to be a wonderful thing, it does open for many fraudulent activities such as performing care that was not needed, demanding reimbursement for patients that were never there or consequently choosing expensive operations…among many others.

scheme

To illustrate how deviant behavior is detected we have created a fictive set of 10 000 dentists that under a week have registered performed care. To simplify the task, we have selected 8 different operations labelled Job1 through Job 8. Now, different tasks take different amounts of time and given the fact that there are a limited number of hours during which an individual can work, it seems reasonable to determine whether the total reported number of hours worked by dentists makes sense. This, among many other indicators, may be an indication of fraud. In a reality, one would design a number of indicators and construct a score-card model in which every indicator has a given weight and the aggregated score would determine the probability of a given individual engaging in fraudulent activities.

The table bellows shows the ten first IDs and the number of registered number of time each job has been registered by each dentist.

>head(DataFull, 10)

    ID Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
1   1   41   17   14   40   36   56   88   12
2   2   40   23    7   39   37   56   92    8
3   3   36   20   11   40   32   54   87   11
4   4   34   19   17   40   35   56   87   12
5   5   38   22   18   40   36   56   82   10
6   6   42   19   16   40   31   56   90   11
7   7   43   17   12   41   33   55   90   10
8   8   37   19   15   39   35   57   91   12
9   9   41   19   13   41   36   56   89   11
10 10   42   20   14   39   35   55   97    9

To be able to have some sense of extent to which each individual dentist works we need to know the duration of each job. We have the following table:

Time     15min    30min    45min    75min    35min    10min   120min    30min

Job       Job1       Job2     Job3        Job4     Job5      Job6     Job7       Job8

The resulting table being

   ID Job1 Job2 Job3 Job4  Job5   Job6  Job7    Job8   Sum (min)
1   1  615  510  630 3 000 1 260  560   10 560  360    34 992
2   2  600  690  315 2 925 1 295  560   11 040  240    35 334
3   3  540  600  495 3 000 1 120  540   10 440  330    34 136
4   4  510  570  765 3 000 1 225  560   10 440  360    34 868
5   5  570  660  810 3 000 1 260  560   9 840   300    34 010
6   6  630  570  720 3 000 1 085  560   10 800  330    35 402
7   7  645  510  540 3 075 1 155  550   10 800  300    35 164
8   8  555  570  675 2 925 1 225  570   10 920  360    35 616
9   9  615  570  585 3 075 1 260  560   10 680  330    35 368
10 10  630  600  630 2 925 1 225  550   11 640  270    36 960

Though we only show the first 10 rows of the data we observe that there doesn’t seem to be of any particulary striking about the data, that is, no cumulated time for any particular Job and dentist sticks out as beeing too large. The same is true about most of the data in the dataset. So, if any of them is cheating the system, how do we find it? There are as many methods to detect the unsual patterns as there are ways to fool the system. R offers a number of packages such as “outlier” or “mvoutlier“, but in this particular case, I believe that using Cook’s distance is a more appropriate approach. Now, how does that work? If we recall the previous example, we tried to fit a linear regression model to a set of points. As we observed data points with large residuals (what we here call outliers)  may influence the outcome and accuracy of a regression in a negative way. Cook’s distance is the measure the effect of deleting a given observation on the result of the regression and the points with a large Cook’s distance are the points that we should consider as abnormal given the general behavior of our dentists. I encourage the readers to pick almost any books in statistics to undertand the mathematical workings of the method. We shall here concentrate on the practical ways to use this to quickly obtain a list of potential fraudsters. To simplify our example we focus on one of the tasks performed by the 10 000 professionals, Job8. We do this because we have for instance been given indications by patients and experts that it is an operation that is both costy and difficult to prove it hasn’t been performed. We decide therefore to investigated how Job8 influences the total sum of labored minutes per dentist, that is, if there is a relation between the total number of minutes worked and Job8.

mod= lm(Sum ~ Job8, data=Full.Time.work)
cooksd = cooks.distance(mod)
CooksDist= as.data.frame(cooksd)
CooksDist.id = cbind(ID,CooksDist)

The last line gives us the Cook’s distances for every dentist. It can be quite a challenge to determine which Ids are worth considering for further analysis. We therefore need a little more information, or rules, to focus on relevant data.

plot(cooksd, pch="+", cex=1, main="Influential observations by Cooks distance") 
abline(h = 4*mean(cooksd, na.rm=T), col="red")
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd > 4*mean(cooksd, na.rm=T), names(cooksd),""), col="red")

As in many other areas of statistical analysis, there are bounds that are given as general thumb rules to be followed. When it comes to the influence of data points to a linear model, one unsual considers points with a Cook’s distance 4 times the mean as points influencing a regression. We adopt that rule but can after having investigated these particular individuals change the parametes.

The above code gives the following plot, where the points having a Cook’s distance 4 times the mean are marked as red, the identification of which is given below the graph.

InfluenceofJob8onSum.jpeg

Outlier.ids= as.vector(which(CooksDist.id$cooksd >8*mean(cooksd)))
Outlier.ids
6888 6918 6931 6951 6998 7014 7034 7055 7056 7086 7087 7127 7128 7174 7190 7266 7269 7437 7529 7552 7577 7584 7610 7616 7638 7746 7772 7788 8095 8170

The same procedure can be reiterated for any number of Jobs (i = 1, 2, …,8) or for combinations, depending on the setting. Cook’s distance gives an indication of which data points may need to be considered. They are however no proof of fraud….this step demands further, more investigatory, work. But in organisations where millions of transactions are performed daily, designing statistical methods to isolate “unusual” events are essential and cheap. Investagating every transactions is too costy and randomly pick transactions most often gives too few satisfactory results.

In conclusion, the detection of outliers is essential to a good understanding of data and determining what the aim of an analysis is also determines how outliers should be handled and delt with.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: