Dealing with missing-values

Author

Alape Aniruddha

Introduction

Dealing with missing data is often the first step when it comes to data-preprocessing. However, not all missing data are the same, requiring us to develop a good understanding of the types of missingness. The types of missingness can be primarily classified to missing completely at random(MCAR), missing at random(MAR), missing not at random(MNAR)

Bias

When data is missing, the remaining data may not always accurately reflect the true population. The type of missingness and the way we deal with it can dictate bias in our results.

For example, in a survey recording income of individuals, those with low income may choose not to respond. The use of average income to impute the missing values can lead to bias as the average of the available data may not represent that of the population.

Missing Completely At Random (MCAR)

In situations characterized by Missing Completely At Random (MCAR), the absence of data occurs randomly and is unrelated to any variable in the dataset or the missing values themselves. The probability of missingness is same for all observations. There exists no underlying pattern to the missingness. The missing values are completely independent of other data. All statistical analysis performed on the dataset will remain unbiased in this case.

For example, during data collection, if some responses were not collected due to a technical error, then the missing data is completely at random.

Missing At Random (MAR)

In instances of Missing At Random (MAR), the absence of data can be entirely explained by the values of other known variables in the dataset. There exists some form of pattern in the missing values.

For example, In a survey, women might be unwilling to disclose their age. Here the missingness of the variable age can be explained by another observed variable “gender”.

Missing Not At Random (MNAR)

In this case, the missingness is neither MCAR nor MAR. The fact that a datapoint is missing is dependent on the value of the data point. In order to correct for the bias we would have to make modelling assumptions about the nature of the bias.

For example, in a social survey where individuals are asked about their income, respondents may not disclose it if it is too high or too low. Thus the missingness in the feature income is directly related to the values of that feature.

For example, if a patient’s measurement was not taken because the doctor felt he was too sick, that observation would not be MAR or MCAR. In this case the missing data mechanism causes our observed training data to give a distorted picture of the true population, and data imputation is dangerous in this instance.

Identifying the type of missingness

  • Understanding the data collection process, the features involved and the research domain can help identify possible patterns as to why data is missing
  • Graphical representation of the missing data and heatmaps can help identify relationships between features that can be utilized to make better imputation of the missing data
  • We can impute the data using different techniques while also making assumptions on the nature of missingness. Subsequently, we analyze the results to observe the assumptions that have led to consistent results.
  • For categorical features, we can code missing as an additional class(feature). We can then fit our model on the training data and see if the class “Missing” is predictive of the response(label)

Dealing with missing values

If we assume that the features are missing completely at random, there are three approaches that we can follow

  • Discard the observations with any missing values
  • Choose an algorithm that inherently deals with missing values
  • Impute the missing values before training

Missing Value Imputation

In many cases it might not be feasible to discard observations that contain missing values. This could be due to the availability of lesser number of samples or due to the importance of each observation. In such cases we can employ imputation which deals with replacing the missing values with some predicted value. We will have a look at two simple imputation techniques that can be implemented using the sklearn package.

  • Simple Imputer
  • K Nearest neighbours Imputer

Simple Imputer

Simple Imputation is a univariate imputation technique where the missing values of a feature are imputed using the non-missing values of the same feature. It offers a basic approach to filling missing data wherein we replace the missing data with a constant value or utilize statistics such as “mean”, “mode”, or “median” of the available values.

The technique is useful for its simplicity and can serve as a reference technique. It is also worth noting that this can lead to biased or unrealistic results

Simple Imputer - Mean

import numpy as np
import pandas as pd

Consider the following matrix, A=\begin{bmatrix} 1 & 1 & 3\\ 2 & 1 & 3\\ 3 & nan & 3\\ 4 & 2 & 4\\ 5 & 4 & 5 \end{bmatrix}

We can fill the missing value using mean strategy. The mean value of that feature will be \frac{( 1+1+2+4)}{4} = 2

The updated matrix would be, A=\begin{bmatrix} 1 & 1 & 3\\ 2 & 1 & 3\\ 3 & 2 & 3\\ 4 & 2 & 4\\ 5 & 4 & 5 \end{bmatrix}

data = {"feature_1":[1,2,3,4,5],
        "feature_2":[1,1,np.nan,2,4],
        "feature_3":[3,3,3,4,5]}
df = pd.DataFrame(data)
df
feature_1 feature_2 feature_3
0 1 1.0 3
1 2 1.0 3
2 3 NaN 3
3 4 2.0 4
4 5 4.0 5
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(df)
imputed_df = pd.DataFrame(imputed_data, columns=['feature_1','feature_2','feature_3'])
imputed_df
feature_1 feature_2 feature_3
0 1.0 1.0 3.0
1 2.0 1.0 3.0
2 3.0 2.0 3.0
3 4.0 2.0 4.0
4 5.0 4.0 5.0

Simple Imputer - Mode

Consider the following matrix, A=\begin{bmatrix} 1 & 1 & 3\\ 2 & 1 & 3\\ 3 & nan & 3\\ 4 & 2 & 4\\ 5 & 4 & 5 \end{bmatrix}

We can fill the missing value using mode strategy. The value that appears most often is 1.

The updated matrix would be, A=\begin{bmatrix} 1 & 1 & 3\\ 2 & 1 & 3\\ 3 & 1 & 3\\ 4 & 2 & 4\\ 5 & 4 & 5 \end{bmatrix}

mode_imputer = SimpleImputer(strategy='most_frequent')
imputed_data = mode_imputer.fit_transform(df)
mode_imputed_df = pd.DataFrame(imputed_data, columns=['feature_1','feature_2','feature_3'])
mode_imputed_df
feature_1 feature_2 feature_3
0 1.0 1.0 3.0
1 2.0 1.0 3.0
2 3.0 1.0 3.0
3 4.0 2.0 4.0
4 5.0 4.0 5.0

K Nearest Neighbours

This technique operates on the principle of finding the K most similar data points to the one with missing values and using their known values to estimate and impute the missing values.

KNN imputation is especially useful when there is a pattern or structure in the data that can be captured by considering the relationships between neighboring data points. It’s a flexible method that can be applied to various types of data, but the choice of K and the distance metric are critical parameters that can impact imputation accuracy.

Consider the following matrix, A=\begin{bmatrix} 1 & 1 & 3\\ 2 & 1 & 3\\ 3 & nan & 3\\ 4 & 2 & 4\\ 5 & 4 & 5 \end{bmatrix}
We can fill the missing value using KNN Imputer with value of k = 2. Firstly, we need to find the distance of our entry(row) containing the missing value, with other entries(rows).

The distance of \begin{bmatrix} 3 & nan & 3 \end{bmatrix} with \begin{bmatrix} 1 & 1 & 3 \end{bmatrix} is given by ,
\displaystyle \sqrt{\frac{3}{2}\left(( 3-1)^{2} +( 3-3)^{2}\right)} =\ \sqrt{6}
In this example the nearest neighbors based on Euclidean distance will be the points \begin{bmatrix} 2 & 1 & 3 \end{bmatrix} and \begin{bmatrix} 4 & 2 & 4 \end{bmatrix}.Now to fill the missing value, we take the average of the values of the feature from the 2 nearest neighbors identified above.


\begin{align*} \frac{\text{Sum of values of the feature from 2 neares neighbors}}{\text{Number of values}} & =\frac{1+2}{2} \ =\ 1.5 \end{align*}


Thus the updated matrix would be, A=\begin{bmatrix} 1 & 1 & 3\\ 2 & 1 & 3\\ 3 & 1.5 & 3\\ 4 & 2 & 4\\ 5 & 4 & 5 \end{bmatrix}

from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
knn_imputed_data = knn_imputer.fit_transform(df)
knn_imputed_df = pd.DataFrame(knn_imputed_data, columns=['feature_1','feature_2','feature_3'])
knn_imputed_df
feature_1 feature_2 feature_3
0 1.0 1.0 3.0
1 2.0 1.0 3.0
2 3.0 1.5 3.0
3 4.0 2.0 4.0
4 5.0 4.0 5.0

Exercises

Use the links provided below to access the datasets. Treat the dataset like an unsupervised learning problem with the only “constraint” being that findings/results should pertain to missingness. Share your findings/results in a ipynb file.

  1. Link to dataset 1: https://drive.google.com/file/d/1Fk5V6GNNfdgm8qB4sA1EinB9gQhNdg4Z/view?usp=drive_link

  2. Link to dataset 2: https://drive.google.com/file/d/15HYh3p6SZjic_Ytef84b1vsgYcWoqh4L/view?usp=drive_link

Additional Info

Dealing with missingness in Tree-Based Methods

For tree based models we can try a couple of different approaches

Approach 1: The first is applicable to categorical predictors: we simply make a new category for “missing.” From this we might discover that observations with missing values for some measurement behave differently than those with nonmissing values.

Approach 2: The second more general approach is the construction of surrogate variables. Surrogate splits use alternative variables when the primary variable has missing values. The idea is to leverage the correlation or similarity between variables, allowing them to substitute for each other in the splitting process. This enables the inclusion of records with missing values in the training data, ensuring they are split based on the best available variable. The effectiveness of this technique is dependent on the quality and availability of surrogate variables.

Missingness Indicator

The idea of missingness Indicator is to make use of a binary indicator that highlights the missing values in your dataset. This can be used when we want the ML models to capture information about the missingness

Estimators that handle NaN values

CART, MARS, PRIM, GBM