An easy way to deal with Missing Data – Imputation by Regression

Introduction

I was recently asked to give a talk for junior data scientists about analytics and machine lerning. As much as I like to talk publicly, I was scratching my head about what I could offer these young minds, at least in novel knowledge. Let’s face it: these people are fresh out of school and everything has recently entered their minds. So, talking about random forest models and neural networks with some examples would just feel like another boring lecture or textbook demo. A total waste of both their and my time and that can’t be good. That’s when it suddenly hit me! Who has ever worked on a real-life project in which everything was a textbook example going smoothly from start to finish?  I have very seldom worked with data that was so clean as to allow the development of a model by following the best-case scenario instructions given to me by some textbook  without encountering a myriad of problems that need to be solved on the fly. in the cases it actually did work as a charm, the data had been prepared. This experience can be daunting and everyone should be presented cases were things can be really bad…..and be given the skills to deal with them in order to solve the problem. At least, the experience has the merit of being humbling and inspiring at the same time.

One of the classic issues encountered is data or the lack thereof. In a previous post, On the importance of outlier detection, I discussed the problems that some data might cause and how they should be handled. But what happens when it actually is the lack of data that is the real issue? Are there textbook examples on how to handle missing data? Should one completely ignore features or variables presenting missing data? can it be imputed, and if so, how should it be done.

This blog post is an attempt at shedding so light on a number of ways to deal with missing data and to give the reader tools that alleviated at least some of the anguish associated with these problems. To avoid a far too lengthy post (which I already suspect it is), I will mainly concentrate on one method, namely, Imputation by regression.

As for my talk to my younger colleague, I have decided to show them two examples in which either no data existed or was so compromised that the methods in this blog post would not have solved to problem. In the latter case, data cleasing had to be done before even starting to do the real work.

There is no data – getting it

dudewheresmydata_0

Obviously, this is an extreme case but not an uncommon one either. There are projects that actually require the pain-staking task of actually acquire the necessary data. This can happen for a number of reasons. A while ago, I was involved in a project in which we wanted to segment a population into groups of individuals with specific needs in health care. This had never been done previously and the data that was needed to complete the task included a collection of characteristics that needed to be obtained as blocks (i.e. surveys might have been done on particular features, but not in connections with many seemingly unrelated dimensions. I disscused this particular project in a previous post, Artificial Neural Network and Patient Segmentation.

Machine Learning and AI are two hot topics and many businesses want to jump on the train and become early adopters. This is quite undestandable but they also have little or no knowledge of the requirement to implement solutions that will give them an edge on a competitive market. They want to be given AI but do not understand that it isn’t a magic wand that will solve all their problems, neither that AI or ML in general requires large amounts of data, an ore that they sometimes lack. There is no shortcut and no magical solution to this.

Fortunately, these are extreme cases, although it might be a blessing for an analyst to know that the data gathered is at it should be from the start if given the opportunity to design the entire project from data collection to the application of a model.

Reality

Most of the time, the situation looks something like  “no data” or “perfect data” with all the possible shades of crappy in between. Most often, the level of crappiness has to do with missing data and or eroneous values. Both the data scientist’s goals and the amount of missing data determine which methods to use to remedy the problem. An important observation that needs to be made is that complete datasets (by this I mean datasets that have not been modified) are rare. Somehow, things go wrong and data is missing. Nevertheless, most published articles or analysis exhibit datasets with full data. Indeed, the best way to handle missing data is to not have it to begin with. This is very unlikely and the authors very seldom give any indication on how they have dealt with their missing observations. Finally, this implies a long list of questions on the validity of the conclusions drawn in some studies. Although it might be understandable that deleting missing data (and omitting admitting having done it) is a tempting quick fix, it sheds a shadow on whatever is done from that point on.

Types of missing data

Missing data can grocely be classified inte three types:

MCAR (Missing Completely At Ramdom), which means that there is nothing systematic about why some dat ais missing. That is, that there is no relationship between the fact that data is missing and either the observed or unobserved covariates. This simply means that the observed non-missing data is a random sample of the entire population and that any analysis made on the existing data is unbaised. A fairly easy example of MCAR-data would be measures taken by battery-driven instruments. If the battery runs out of juice, measurements would be missing until they are replaced. It has nothing to do with what is measured or by whom the instrument is operated.

MAR (Missing ARandom), resembles MCAR because there still is an element of randomness. However, it is slightly weaker than MCAR (meaning that MCAR implies MAR, but not the inverse). The missingness is still random but can have some relationship with other variables in the data. Two common examples are the unwillingness of some people (often those with higher socieconomic status) to give information about their earnings or the propencity of women not to give their weight in surveys. The element of randomness is due to the fact that given your socioeconomic status or gender does not imply that no one gives information but rather that within a class or gender people will randomly choose not to answer.

MNAR (Missing Not At Random) implies that the fact that data is missing is directly correlated with the value of the missing data. This can happen for multiple reasons. Instruments, for example, can have a limited range of observation and anything measured falling out of this range will be recorded as missing.

Now, a world in which data would be perfect or at least of MCAR class would be a wonderful place.  but MCAR-data is a rare thing. Given a dataset, are there ways to determine what the nature of the missing data is? Are there methods to test the data to make intelligent judgement calls to classify the type of missing data. The answer is a “well, not really but..”-type of answer. There are things you may do to give you hints, but they are in no way a clear-cut answer. For instance, to test if missing data is MCAR, one would usually perform Little’s test. It tests the null hypothesis that the missing data is MCAR. A p.value of less than 0.05 is usually interpreted as being that the missing data is not MCAR, i.e. that it is either MAR or MNAR. An example of how it works is provided in the R-package BaylorEdPsych on the dataset EndersTable1_1.

enderstable

which contains information about individual’s IQ, Job performance (JP) and psychological well-being (WB). The code to test whether the missing data is MCAR is simple:

library(BaylorEdPsych)
library(mvmle)
d = data(EndersTable1_1)
LittleMCAR(EndersTable1_1)

and the interesting output is

this could take a while$chi.square
[1] 14.63166

$df
[1] 5

$p.value
[1] 0.01205778

$missing.patterns
[1] 4

$amount.missing
IQ JP WB
Number Missing 0 10.0 3.00
Percent Missing 0 0.5 0.15

As can be seen, the p-value is indeed less than 0.05 and Little’s test gives an indication that the missing data could be MCAR. However, it should be stressed that it is an INDICATION, not a proof. The best tip, really, is to educate yourself by reading on all possible ways to identify (or at least get some feeling) which type of missing data you are dealing with.

How to deal with it

The obvious way – Just delete missing entries

As we pointed out about, the temptation of just deleting missing values is strong. Why? Because it simplifies life and we are lazy by nature. And also, because it sometimes is the right thing to do, if you have are honest and careful. There are of course two choices depending on how the data is missing. The first one is to delete rows (i.e. remove obeservations) with missing data and the other is to delete entire columns (i.e. remove variables). In the first case, if the number of rows containing missing values is large, compared to the size of the dataset, it could mean trouble for the analysis to perform. In the same way, eliminating an entire column doesn’t come without its issues. Indeed, there might exist reasons for which the values are missing and the deletion of a variable might introduce biases.  Do check these things before simply deleting rows or columns. We can summarize this method the following way.

PROS: The complete removal of data with missing values results in very accurate model….for the data they have been presented.

To delete individual rows has no effect on the results of models IF it can be shown that the rows containing missing data do not share characteristics (i.e. there is a systemic reason for which these rows are missing data), that is, if it can be shown or is believed that the missing data is MCAR. The same goes for entire columns (variables).

CONS: As pointed out above, the risk of huges loses of inforation is substantial. The results of any analysis done on or any model built on the remainder of the dataset after missing data removal cannot be guaranteed to reflect the truth, unless it is proven that the missing data is MCAR (which it seldom is).

Replacing missing values with the mean or median

There is always a possibility to replace missing data with the mean (or median) of a variable or even other types of functions. the evident problem with these kinds of approaches is that the distribution of values are not taken into account, even less the relationship of that particular feature with all other features in the dataset. So, unless you are very confident about the viability of this kind of imputation, beware.

Imputation by Regression

As the examples above are easy to perform, they do come with an array of problems and questions.  There is, however, a more elegant way to handle a situation in which data is missing for several feature and the relationship between the variables is not evident. The end-game is to replace the missing values with predicted values, the predictions being made using a linear regression model created from the non-missing data part of the dataset. This approach cannot, however, be used directly is missing data occurs in several features. Although this is a problem, it can be solved in a neat way, as will be seen in the example we are going to work through now.

To illustrate the method, I simply downloaded the auto mpg (miles per gallon) dataset from the UCI Machine learning repository   .

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as mno
from sklearn import linear_model
%matplotlib inline

df = pd.read_csv("....../MissingData/auto_mpg_datasetk4.csv")
df.head(10)

datinfo

To get a better view of the distribution of values of each variable we produce a description of the dataset:

df.describe()

datdescribe

We can here see that the variable “cylinder” has a maximum value of 100 (there are no vehicles with 100 cylinder), the variable “horsepower” has a maximum of 100000 (which also is unlikely) and finally, the “weight” of any car cannot be 0. So, obviously, these correspond to missing values (an unconventional way to mark missing value. Usually we use -1, 99999 or NULL). The first step is for us to replace all these missing values with NaN and count the number of instances where NaN occurs, for each variable.

df.loc[df["horsepower"]==100000.0,"horsepower"]=np.NAN
df.loc[df["cylinders"]==100.000000,"cylinders"]=np.NAN
df.loc[df["weight"]==0.000000,"weight"]=np.NAN

df.isnull().sum()[0:9]

cylinders       25
displacement     0
horsepower      37
weight          42
acceleration     0
model_year       0
origin           0
car_name         0
mpg              0
dtype: int64

A neat way to visualize the extent by which values are missing is to use the missingno python package and its mno.matrix function

mno.matrix(df, figsize = (20, 8))

datmnomatrix

having done this we can proceed with the imputation of data. Regression Imputation, however, is a tricky thing and it should NEVER be used to impute data when several variable miss data. The problem is that variables might be correlated and if one attempts to impute one variable using another (correlated) variable which also lacks data, problems will just add up. So, how do we deal with this is a good an mathematically correct manner? Is there a way around it?

A reasonable approach to this is a two-step method using radomly chosen values of each variable (simple random imputation) followed by imputation by regression of each variable. We start by creating an catalogue of the columns lacking data:

missing_columns = ["cylinders", "horsepower", "weight"]

and create a function for the radom imputation:

def rimputation(df, feature):

number_missing = df[feature].isnull().sum()
observed_values = df.loc[df[feature].notnull(), feature]
df.loc[df[feature].isnull(), feature + '_imputed'] = np.random.choice(observed_values, number_missing, replace = True)

return df

which gives us

for feature in missing_columns:
df[feature + '_imputed'] = df[feature]
df = rimputation(df, feature)

randomimpu

Remember that these values are randomly chosen from the non-missing data in each column. The next step is where we actually attempt to predict what the values should have been had they been measured correctly. To do so we build a linear regression model. To do so, we first need to drop variables that cannot be used. In this case, the name of the car is irrelevant to the regression and can therefore be omitted. There could be other values but I haven’t bothered checking this.

df = df.drop(['car_name'],axis=1)
deter_data = pd.DataFrame(columns = ["Deterministic" + name for name in missing_columns])

for feature in missing_columns:

deter_data["Deterministic" + feature] = df[feature + "_imputed"]
parameters = list(set(df.columns) - set(missing_columns) - {feature + '_imputed'})

#Linear Regression model to predict missing data
model = linear_model.LinearRegression()
model.fit(X = df[parameters], y = df[feature + '_imputed'])
deter_data.loc[df[feature].isnull(), "Deterministic" + feature] = model.predict(df[parameters])[df[feature].isnull()]
deter_data

deterministic

Control that no data is missing by using mno.matrix(deter_data, figsize = (20,5)) as above. When dealing with a set of data, often the first thing you’ll want to do is get a sense for how the variables are distributed. The best way to do this is to use the seaborn (sns) package. And while you’re at it, a box plot to check that nothing weird has happened doesn’t hurt.

sns.set()
fig, axes = plt.subplots(nrows = 3, ncols = 2)
fig.set_size_inches(8, 8)

for index, variable in enumerate(["cylinders", "weight","horsepower"]):
sns.distplot(df[variable].dropna(), kde = False, ax = axes[index, 0])
sns.distplot(deter_data["Deterministic" + variable], kde = False, ax = axes[index, 0], color = 'red')

sns.boxplot(data = pd.concat([df[variable], deter_data["Deterministic" + variable]], axis = 1),
ax = axes[index, 1])

plt.tight_layout()

seaborn

A you can see, it worked like a charm. Now, one might argue, rightly, that the dataset was not the most untidy I could have chosen. But, I did not cherry-pick values to remove. Another argument against the method might be that the variance of each feature wasn’t very high and that the method may not be as robust as in this example if the values were to have a wide spread. However, that goes for any method used and the use of a regression model to predict values has a lot of merits, one being that regression models are reliable.

This method, like any other method fails to deliver its promise if too much data is missing. But, in our business, a honest dialogue needs to be held with customers about their data and how we need to be given the necessary tools to work our magic. You know what they say (somewhat paraphrased) “Crap in, crap out!”.

Until next time, test this!

Comments are closed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: