Data Pre-processing in Machine Learning

7 minute read

Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model. The concepts that I will cover in this series of article are-

  1. Handling Null Values
  2. Standardization
  3. Handling Categorical Variables
  4. Discretization
  5. Dimensionality Reduction
  6. Feature Selection

Let’s go through an quick example to have some insights for Handling Null Values!!

import numpy as np
import pandas as pd
diabetes_df=pd.read_csv('datasets/diabetes.csv')
diabetes_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
diabetes_df.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
class CustomImputer:
    """
    This is a helper class which returns the column names based on the column index output
    """
    def __init__(self, df_columns, input_arr):
        self.df_columns = df_columns
        self.input_arr = input_arr
        
    def get_column_names(self):
        column_names=[]
        for _ in self.input_arr:
            column_names.append(self.df_columns[_])
        return column_names
diabetes_df.describe().transpose()
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
Glucose 768.0 120.894531 31.972618 0.000 99.00000 117.0000 140.25000 199.00
BloodPressure 768.0 69.105469 19.355807 0.000 62.00000 72.0000 80.00000 122.00
SkinThickness 768.0 20.536458 15.952218 0.000 0.00000 23.0000 32.00000 99.00
Insulin 768.0 79.799479 115.244002 0.000 0.00000 30.5000 127.25000 846.00
BMI 768.0 31.992578 7.884160 0.000 27.30000 32.0000 36.60000 67.10
DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.24375 0.3725 0.62625 2.42
Age 768.0 33.240885 11.760232 21.000 24.00000 29.0000 41.00000 81.00
Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000 1.00

Findings:

  1. This dataset contains various measures of patients who have diabetes disease.
  2. Looking at the dataset we can say that it is an ideal candidate for regression model.
  3. We can predict that whether a patient has diabetes or not using this.
  4. Dataset contains 768 rows and 9 columns.
  5. we can see apart from ‘Outcome’ column few columns has zero as their numerical measures which is practically impossible as per our knowledge because a person’s blood pressure can’t be zero or even BMI.This means our dataset having some missing values which are represented by zeros.So let’s impute all zeros of these columns with np.nan.

Before imputing missing values we can check at which positions missing values are present in my features by using Missing Indicator.

from sklearn.impute import MissingIndicator
indicator = MissingIndicator(missing_values=0)
indicator.fit_transform(diabetes_df)
indicator.features_
array([0, 1, 2, 3, 4, 5, 8], dtype=int32)

Now let’s instantiate an object from the helper class we have built above.

df_cols=(diabetes_df.columns).tolist()
cols_with_missing_values=(indicator.features_).tolist()
imputer_obj = CustomImputer(df_cols,cols_with_missing_values)
cols=imputer_obj.get_column_names()
cols
['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'Outcome']

Now let’s impute zeros of these columns with NaN except Pregnancies and Outcome as these two columns can have zeros as their actual values.

diabetes_df['Glucose'].replace(0,np.nan,inplace=True)
diabetes_df['BloodPressure'].replace(0,np.nan,inplace=True)
diabetes_df['SkinThickness'].replace(0,np.nan,inplace=True)
diabetes_df['Insulin'].replace(0,np.nan,inplace=True)
diabetes_df['BMI'].replace(0,np.nan,inplace=True)

We can mask our data to check the exact missing data point in our data

from sklearn.impute import MissingIndicator
indicator=MissingIndicator(missing_values=np.nan)
mask_missing_values=indicator.fit_transform(diabetes_df)
mask_missing_values
array([[False, False, False,  True, False],
       [False, False, False,  True, False],
       [False, False,  True,  True, False],
       ...,
       [False, False, False, False, False],
       [False, False,  True,  True, False],
       [False, False, False,  True, False]])

Below output tells us which columns are having missing values along with their counts.

diabetes_df.isnull().sum()
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

So now we can see that only those 5 columns having null values. Next we will be imputing these columns wtih different techniques.You can choose any one of them as per your use case or you can have a discussion with the SME.

1. Using Mode

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="most_frequent")
diabetes_df['Glucose'] = imp.fit_transform(diabetes_df['Glucose'].values.reshape(-1,1))
diabetes_df['Glucose']
0      148.0
1       85.0
2      183.0
3       89.0
4      137.0
       ...  
763    101.0
764    122.0
765    121.0
766    126.0
767     93.0
Name: Glucose, Length: 768, dtype: float64

2. Using Mean

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="mean")
imp.fit(diabetes_df['BloodPressure'].values.reshape(-1,1))
diabetes_df['BloodPressure'] = imp.transform(diabetes_df['BloodPressure'].values.reshape(-1,1))
diabetes_df['BloodPressure']
0      72.0
1      66.0
2      64.0
3      66.0
4      40.0
       ... 
763    76.0
764    70.0
765    72.0
766    60.0
767    70.0
Name: BloodPressure, Length: 768, dtype: float64

3.Using Median

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="median")
imp.fit(diabetes_df['SkinThickness'].values.reshape(-1,1))
diabetes_df['SkinThickness'] = imp.transform(diabetes_df['SkinThickness'].values.reshape(-1,1))
diabetes_df['SkinThickness']
0      35.0
1      29.0
2      29.0
3      23.0
4      35.0
       ... 
763    48.0
764    27.0
765    23.0
766    29.0
767    31.0
Name: SkinThickness, Length: 768, dtype: float64

4. Using a Constant Value

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="constant",fill_value=22)
imp.fit(diabetes_df['BMI'].values.reshape(-1,1))
diabetes_df['BMI'] = imp.transform(diabetes_df['BMI'].values.reshape(-1,1))
diabetes_df['BMI']
0      33.6
1      26.6
2      23.3
3      28.1
4      43.1
       ... 
763    32.9
764    36.8
765    26.2
766    30.1
767    30.4
Name: BMI, Length: 768, dtype: float64

Till this point we have treated the missing values with univariate imputation.Now let’s use multivariate imputer for Insulin column.

5. Multivariate Imputation

Using this technique a value will be predicted for the missing value based on the other features.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp=IterativeImputer(max_iter=10000,random_state=0)

Once the imputer object is instantiated we will be dropping the target column so that biasing to our target variable can be ignored.

diabetes_features=diabetes_df.drop('Outcome',axis=1)
diabetes_features
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148.0 72.0 35.0 NaN 33.6 0.627 50
1 1 85.0 66.0 29.0 NaN 26.6 0.351 31
2 8 183.0 64.0 29.0 NaN 23.3 0.672 32
3 1 89.0 66.0 23.0 94.0 28.1 0.167 21
4 0 137.0 40.0 35.0 168.0 43.1 2.288 33
... ... ... ... ... ... ... ... ...
763 10 101.0 76.0 48.0 180.0 32.9 0.171 63
764 2 122.0 70.0 27.0 NaN 36.8 0.340 27
765 5 121.0 72.0 23.0 112.0 26.2 0.245 30
766 1 126.0 60.0 29.0 NaN 30.1 0.349 47
767 1 93.0 70.0 31.0 NaN 30.4 0.315 23

768 rows × 8 columns

diabetes_label=diabetes_df['Outcome']
diabetes_label
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

Next fit and transform our features dataset to the imputer object

imp.fit(diabetes_features)
IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10000, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, tol=0.001, verbose=0)
diabetes_features_arr=imp.transform(diabetes_features)
diabetes_features_arr
array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])
diabetes_features=pd.DataFrame(diabetes_features_arr,columns = diabetes_features.columns)
diabetes_features
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6.0 148.0 72.0 35.0 218.937760 33.6 0.627 50.0
1 1.0 85.0 66.0 29.0 70.189298 26.6 0.351 31.0
2 8.0 183.0 64.0 29.0 269.968908 23.3 0.672 32.0
3 1.0 89.0 66.0 23.0 94.000000 28.1 0.167 21.0
4 0.0 137.0 40.0 35.0 168.000000 43.1 2.288 33.0
... ... ... ... ... ... ... ... ...
763 10.0 101.0 76.0 48.0 180.000000 32.9 0.171 63.0
764 2.0 122.0 70.0 27.0 158.815881 36.8 0.340 27.0
765 5.0 121.0 72.0 23.0 112.000000 26.2 0.245 30.0
766 1.0 126.0 60.0 29.0 173.820363 30.1 0.349 47.0
767 1.0 93.0 70.0 31.0 87.196731 30.4 0.315 23.0

768 rows × 8 columns

Now check if we have any missing values left.

diabetes_features.isnull().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64

Voila!! we have imputed all the missing data in our dataset.

diabetes_features.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6.0 148.0 72.0 35.0 218.937760 33.6 0.627 50.0
1 1.0 85.0 66.0 29.0 70.189298 26.6 0.351 31.0
2 8.0 183.0 64.0 29.0 269.968908 23.3 0.672 32.0
3 1.0 89.0 66.0 23.0 94.000000 28.1 0.167 21.0
4 0.0 137.0 40.0 35.0 168.000000 43.1 2.288 33.0

Now let’s concatenate our features dataset and label dataset to create the final cleaned

cleaned_diabetes_df=pd.concat([diabetes_features,diabetes_label],axis=1)
cleaned_diabetes_df
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6.0 148.0 72.0 35.0 218.937760 33.6 0.627 50.0 1
1 1.0 85.0 66.0 29.0 70.189298 26.6 0.351 31.0 0
2 8.0 183.0 64.0 29.0 269.968908 23.3 0.672 32.0 1
3 1.0 89.0 66.0 23.0 94.000000 28.1 0.167 21.0 0
4 0.0 137.0 40.0 35.0 168.000000 43.1 2.288 33.0 1
... ... ... ... ... ... ... ... ... ...
763 10.0 101.0 76.0 48.0 180.000000 32.9 0.171 63.0 0
764 2.0 122.0 70.0 27.0 158.815881 36.8 0.340 27.0 0
765 5.0 121.0 72.0 23.0 112.000000 26.2 0.245 30.0 0
766 1.0 126.0 60.0 29.0 173.820363 30.1 0.349 47.0 1
767 1.0 93.0 70.0 31.0 87.196731 30.4 0.315 23.0 0

768 rows × 9 columns

diabetes.to_csv('datasets/diabetes_cleaned')

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!

Leave a comment