Data Pre-processing in Machine Learning

7 minute read

Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model. The concepts that I will cover in this series of article are-

Handling Null Values
Standardization
Handling Categorical Variables
Discretization
Dimensionality Reduction
Feature Selection

Let’s go through an quick example to have some insights for Handling Null Values!!

import numpy as np
import pandas as pd

diabetes_df=pd.read_csv('datasets/diabetes.csv')
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

diabetes_df.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

class CustomImputer:
    """
    This is a helper class which returns the column names based on the column index output
    """
    def __init__(self, df_columns, input_arr):
        self.df_columns = df_columns
        self.input_arr = input_arr
        
    def get_column_names(self):
        column_names=[]
        for _ in self.input_arr:
            column_names.append(self.df_columns[_])
        return column_names

diabetes_df.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
Pregnancies	768.0	3.845052	3.369578	0.000	1.00000	3.0000	6.00000	17.00
Glucose	768.0	120.894531	31.972618	0.000	99.00000	117.0000	140.25000	199.00
BloodPressure	768.0	69.105469	19.355807	0.000	62.00000	72.0000	80.00000	122.00
SkinThickness	768.0	20.536458	15.952218	0.000	0.00000	23.0000	32.00000	99.00
Insulin	768.0	79.799479	115.244002	0.000	0.00000	30.5000	127.25000	846.00
BMI	768.0	31.992578	7.884160	0.000	27.30000	32.0000	36.60000	67.10
DiabetesPedigreeFunction	768.0	0.471876	0.331329	0.078	0.24375	0.3725	0.62625	2.42
Age	768.0	33.240885	11.760232	21.000	24.00000	29.0000	41.00000	81.00
Outcome	768.0	0.348958	0.476951	0.000	0.00000	0.0000	1.00000	1.00

Findings:

This dataset contains various measures of patients who have diabetes disease.
Looking at the dataset we can say that it is an ideal candidate for regression model.
We can predict that whether a patient has diabetes or not using this.
Dataset contains 768 rows and 9 columns.
we can see apart from ‘Outcome’ column few columns has zero as their numerical measures which is practically impossible as per our knowledge because a person’s blood pressure can’t be zero or even BMI.This means our dataset having some missing values which are represented by zeros.So let’s impute all zeros of these columns with np.nan.

Before imputing missing values we can check at which positions missing values are present in my features by using Missing Indicator.

from sklearn.impute import MissingIndicator
indicator = MissingIndicator(missing_values=0)
indicator.fit_transform(diabetes_df)
indicator.features_

array([0, 1, 2, 3, 4, 5, 8], dtype=int32)

Now let’s instantiate an object from the helper class we have built above.

df_cols=(diabetes_df.columns).tolist()
cols_with_missing_values=(indicator.features_).tolist()
imputer_obj = CustomImputer(df_cols,cols_with_missing_values)
cols=imputer_obj.get_column_names()
cols

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'Outcome']

Now let’s impute zeros of these columns with NaN except Pregnancies and Outcome as these two columns can have zeros as their actual values.

diabetes_df['Glucose'].replace(0,np.nan,inplace=True)
diabetes_df['BloodPressure'].replace(0,np.nan,inplace=True)
diabetes_df['SkinThickness'].replace(0,np.nan,inplace=True)
diabetes_df['Insulin'].replace(0,np.nan,inplace=True)
diabetes_df['BMI'].replace(0,np.nan,inplace=True)

We can mask our data to check the exact missing data point in our data

from sklearn.impute import MissingIndicator
indicator=MissingIndicator(missing_values=np.nan)
mask_missing_values=indicator.fit_transform(diabetes_df)
mask_missing_values

array([[False, False, False,  True, False],
       [False, False, False,  True, False],
       [False, False,  True,  True, False],
       ...,
       [False, False, False, False, False],
       [False, False,  True,  True, False],
       [False, False, False,  True, False]])

Below output tells us which columns are having missing values along with their counts.

diabetes_df.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

So now we can see that only those 5 columns having null values. Next we will be imputing these columns wtih different techniques.You can choose any one of them as per your use case or you can have a discussion with the SME.

1. Using Mode

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="most_frequent")
diabetes_df['Glucose'] = imp.fit_transform(diabetes_df['Glucose'].values.reshape(-1,1))
diabetes_df['Glucose']

    148.0
     85.0
    183.0
     89.0
    137.0
       ...  
  101.0
  122.0
  121.0
  126.0
   93.0
Name: Glucose, Length: 768, dtype: float64

2. Using Mean

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="mean")
imp.fit(diabetes_df['BloodPressure'].values.reshape(-1,1))
diabetes_df['BloodPressure'] = imp.transform(diabetes_df['BloodPressure'].values.reshape(-1,1))
diabetes_df['BloodPressure']

    72.0
    66.0
    64.0
    66.0
    40.0
       ... 
  76.0
  70.0
  72.0
  60.0
  70.0
Name: BloodPressure, Length: 768, dtype: float64

3.Using Median

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="median")
imp.fit(diabetes_df['SkinThickness'].values.reshape(-1,1))
diabetes_df['SkinThickness'] = imp.transform(diabetes_df['SkinThickness'].values.reshape(-1,1))
diabetes_df['SkinThickness']

    35.0
    29.0
    29.0
    23.0
    35.0
       ... 
  48.0
  27.0
  23.0
  29.0
  31.0
Name: SkinThickness, Length: 768, dtype: float64

4. Using a Constant Value

from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="constant",fill_value=22)
imp.fit(diabetes_df['BMI'].values.reshape(-1,1))
diabetes_df['BMI'] = imp.transform(diabetes_df['BMI'].values.reshape(-1,1))
diabetes_df['BMI']

    33.6
    26.6
    23.3
    28.1
    43.1
       ... 
  32.9
  36.8
  26.2
  30.1
  30.4
Name: BMI, Length: 768, dtype: float64

Till this point we have treated the missing values with univariate imputation.Now let’s use multivariate imputer for Insulin column.

5. Multivariate Imputation

Using this technique a value will be predicted for the missing value based on the other features.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp=IterativeImputer(max_iter=10000,random_state=0)

Once the imputer object is instantiated we will be dropping the target column so that biasing to our target variable can be ignored.

diabetes_features=diabetes_df.drop('Outcome',axis=1)
diabetes_features

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148.0	72.0	35.0	NaN	33.6	0.627	50
1	1	85.0	66.0	29.0	NaN	26.6	0.351	31
2	8	183.0	64.0	29.0	NaN	23.3	0.672	32
3	1	89.0	66.0	23.0	94.0	28.1	0.167	21
4	0	137.0	40.0	35.0	168.0	43.1	2.288	33
...	...	...	...	...	...	...	...	...
763	10	101.0	76.0	48.0	180.0	32.9	0.171	63
764	2	122.0	70.0	27.0	NaN	36.8	0.340	27
765	5	121.0	72.0	23.0	112.0	26.2	0.245	30
766	1	126.0	60.0	29.0	NaN	30.1	0.349	47
767	1	93.0	70.0	31.0	NaN	30.4	0.315	23

768 rows × 8 columns

diabetes_label=diabetes_df['Outcome']
diabetes_label

    1
    0
    1
    0
    1
      ..
  0
  0
  0
  1
  0
Name: Outcome, Length: 768, dtype: int64

Next fit and transform our features dataset to the imputer object

imp.fit(diabetes_features)

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10000, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=0,
                 sample_posterior=False, tol=0.001, verbose=0)

diabetes_features_arr=imp.transform(diabetes_features)
diabetes_features_arr

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

diabetes_features=pd.DataFrame(diabetes_features_arr,columns = diabetes_features.columns)
diabetes_features

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6.0	148.0	72.0	35.0	218.937760	33.6	0.627	50.0
1	1.0	85.0	66.0	29.0	70.189298	26.6	0.351	31.0
2	8.0	183.0	64.0	29.0	269.968908	23.3	0.672	32.0
3	1.0	89.0	66.0	23.0	94.000000	28.1	0.167	21.0
4	0.0	137.0	40.0	35.0	168.000000	43.1	2.288	33.0
...	...	...	...	...	...	...	...	...
763	10.0	101.0	76.0	48.0	180.000000	32.9	0.171	63.0
764	2.0	122.0	70.0	27.0	158.815881	36.8	0.340	27.0
765	5.0	121.0	72.0	23.0	112.000000	26.2	0.245	30.0
766	1.0	126.0	60.0	29.0	173.820363	30.1	0.349	47.0
767	1.0	93.0	70.0	31.0	87.196731	30.4	0.315	23.0

768 rows × 8 columns

Now check if we have any missing values left.

diabetes_features.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64

Voila!! we have imputed all the missing data in our dataset.

diabetes_features.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6.0	148.0	72.0	35.0	218.937760	33.6	0.627	50.0
1	1.0	85.0	66.0	29.0	70.189298	26.6	0.351	31.0
2	8.0	183.0	64.0	29.0	269.968908	23.3	0.672	32.0
3	1.0	89.0	66.0	23.0	94.000000	28.1	0.167	21.0
4	0.0	137.0	40.0	35.0	168.000000	43.1	2.288	33.0

Now let’s concatenate our features dataset and label dataset to create the final cleaned

cleaned_diabetes_df=pd.concat([diabetes_features,diabetes_label],axis=1)
cleaned_diabetes_df

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6.0	148.0	72.0	35.0	218.937760	33.6	0.627	50.0	1
1	1.0	85.0	66.0	29.0	70.189298	26.6	0.351	31.0	0
2	8.0	183.0	64.0	29.0	269.968908	23.3	0.672	32.0	1
3	1.0	89.0	66.0	23.0	94.000000	28.1	0.167	21.0	0
4	0.0	137.0	40.0	35.0	168.000000	43.1	2.288	33.0	1
...	...	...	...	...	...	...	...	...	...
763	10.0	101.0	76.0	48.0	180.000000	32.9	0.171	63.0	0
764	2.0	122.0	70.0	27.0	158.815881	36.8	0.340	27.0	0
765	5.0	121.0	72.0	23.0	112.000000	26.2	0.245	30.0	0
766	1.0	126.0	60.0	29.0	173.820363	30.1	0.349	47.0	1
767	1.0	93.0	70.0	31.0	87.196731	30.4	0.315	23.0	0

768 rows × 9 columns

diabetes.to_csv('datasets/diabetes_cleaned')

You can get the notebook used in this tutorial here and dataset used here

Thanks for reading!

Share on

Twitter Facebook LinkedIn

Arup Bhunia

Data Pre-processing in Machine Learning

Findings:

1. Using Mode

2. Using Mean

3.Using Median

4. Using a Constant Value

5. Multivariate Imputation

Share on

Leave a comment

You may also enjoy

Feature scaling and transformation in machine learning

T104: Handling Multicollinearity-Feature selection techniques in machine learning

T103: Filter method-Feature selection techniques in machine learning

T102: Wrapper method-Feature selection techniques in machine learning