Data Pre-processing in Machine Learning
Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model. The concepts that I will cover in this series of article are-
- Handling Null Values
- Standardization
- Handling Categorical Variables
- Discretization
- Dimensionality Reduction
- Feature Selection
Let’s go through an quick example to have some insights for Handling Null Values!!
import numpy as np
import pandas as pd
diabetes_df=pd.read_csv('datasets/diabetes.csv')
diabetes_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
diabetes_df.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
class CustomImputer:
"""
This is a helper class which returns the column names based on the column index output
"""
def __init__(self, df_columns, input_arr):
self.df_columns = df_columns
self.input_arr = input_arr
def get_column_names(self):
column_names=[]
for _ in self.input_arr:
column_names.append(self.df_columns[_])
return column_names
diabetes_df.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Pregnancies | 768.0 | 3.845052 | 3.369578 | 0.000 | 1.00000 | 3.0000 | 6.00000 | 17.00 |
Glucose | 768.0 | 120.894531 | 31.972618 | 0.000 | 99.00000 | 117.0000 | 140.25000 | 199.00 |
BloodPressure | 768.0 | 69.105469 | 19.355807 | 0.000 | 62.00000 | 72.0000 | 80.00000 | 122.00 |
SkinThickness | 768.0 | 20.536458 | 15.952218 | 0.000 | 0.00000 | 23.0000 | 32.00000 | 99.00 |
Insulin | 768.0 | 79.799479 | 115.244002 | 0.000 | 0.00000 | 30.5000 | 127.25000 | 846.00 |
BMI | 768.0 | 31.992578 | 7.884160 | 0.000 | 27.30000 | 32.0000 | 36.60000 | 67.10 |
DiabetesPedigreeFunction | 768.0 | 0.471876 | 0.331329 | 0.078 | 0.24375 | 0.3725 | 0.62625 | 2.42 |
Age | 768.0 | 33.240885 | 11.760232 | 21.000 | 24.00000 | 29.0000 | 41.00000 | 81.00 |
Outcome | 768.0 | 0.348958 | 0.476951 | 0.000 | 0.00000 | 0.0000 | 1.00000 | 1.00 |
Findings:
- This dataset contains various measures of patients who have diabetes disease.
- Looking at the dataset we can say that it is an ideal candidate for regression model.
- We can predict that whether a patient has diabetes or not using this.
- Dataset contains 768 rows and 9 columns.
- we can see apart from ‘Outcome’ column few columns has zero as their numerical measures which is practically impossible as per our knowledge because a person’s blood pressure can’t be zero or even BMI.This means our dataset having some missing values which are represented by zeros.So let’s impute all zeros of these columns with np.nan.
Before imputing missing values we can check at which positions missing values are present in my features by using Missing Indicator.
from sklearn.impute import MissingIndicator
indicator = MissingIndicator(missing_values=0)
indicator.fit_transform(diabetes_df)
indicator.features_
array([0, 1, 2, 3, 4, 5, 8], dtype=int32)
Now let’s instantiate an object from the helper class we have built above.
df_cols=(diabetes_df.columns).tolist()
cols_with_missing_values=(indicator.features_).tolist()
imputer_obj = CustomImputer(df_cols,cols_with_missing_values)
cols=imputer_obj.get_column_names()
cols
['Pregnancies',
'Glucose',
'BloodPressure',
'SkinThickness',
'Insulin',
'BMI',
'Outcome']
Now let’s impute zeros of these columns with NaN except Pregnancies and Outcome as these two columns can have zeros as their actual values.
diabetes_df['Glucose'].replace(0,np.nan,inplace=True)
diabetes_df['BloodPressure'].replace(0,np.nan,inplace=True)
diabetes_df['SkinThickness'].replace(0,np.nan,inplace=True)
diabetes_df['Insulin'].replace(0,np.nan,inplace=True)
diabetes_df['BMI'].replace(0,np.nan,inplace=True)
We can mask our data to check the exact missing data point in our data
from sklearn.impute import MissingIndicator
indicator=MissingIndicator(missing_values=np.nan)
mask_missing_values=indicator.fit_transform(diabetes_df)
mask_missing_values
array([[False, False, False, True, False],
[False, False, False, True, False],
[False, False, True, True, False],
...,
[False, False, False, False, False],
[False, False, True, True, False],
[False, False, False, True, False]])
Below output tells us which columns are having missing values along with their counts.
diabetes_df.isnull().sum()
Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
So now we can see that only those 5 columns having null values. Next we will be imputing these columns wtih different techniques.You can choose any one of them as per your use case or you can have a discussion with the SME.
1. Using Mode
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="most_frequent")
diabetes_df['Glucose'] = imp.fit_transform(diabetes_df['Glucose'].values.reshape(-1,1))
diabetes_df['Glucose']
0 148.0
1 85.0
2 183.0
3 89.0
4 137.0
...
763 101.0
764 122.0
765 121.0
766 126.0
767 93.0
Name: Glucose, Length: 768, dtype: float64
2. Using Mean
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="mean")
imp.fit(diabetes_df['BloodPressure'].values.reshape(-1,1))
diabetes_df['BloodPressure'] = imp.transform(diabetes_df['BloodPressure'].values.reshape(-1,1))
diabetes_df['BloodPressure']
0 72.0
1 66.0
2 64.0
3 66.0
4 40.0
...
763 76.0
764 70.0
765 72.0
766 60.0
767 70.0
Name: BloodPressure, Length: 768, dtype: float64
3.Using Median
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="median")
imp.fit(diabetes_df['SkinThickness'].values.reshape(-1,1))
diabetes_df['SkinThickness'] = imp.transform(diabetes_df['SkinThickness'].values.reshape(-1,1))
diabetes_df['SkinThickness']
0 35.0
1 29.0
2 29.0
3 23.0
4 35.0
...
763 48.0
764 27.0
765 23.0
766 29.0
767 31.0
Name: SkinThickness, Length: 768, dtype: float64
4. Using a Constant Value
from sklearn.impute import SimpleImputer
imp=SimpleImputer(missing_values=np.nan,strategy="constant",fill_value=22)
imp.fit(diabetes_df['BMI'].values.reshape(-1,1))
diabetes_df['BMI'] = imp.transform(diabetes_df['BMI'].values.reshape(-1,1))
diabetes_df['BMI']
0 33.6
1 26.6
2 23.3
3 28.1
4 43.1
...
763 32.9
764 36.8
765 26.2
766 30.1
767 30.4
Name: BMI, Length: 768, dtype: float64
Till this point we have treated the missing values with univariate imputation.Now let’s use multivariate imputer for Insulin column.
5. Multivariate Imputation
Using this technique a value will be predicted for the missing value based on the other features.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp=IterativeImputer(max_iter=10000,random_state=0)
Once the imputer object is instantiated we will be dropping the target column so that biasing to our target variable can be ignored.
diabetes_features=diabetes_df.drop('Outcome',axis=1)
diabetes_features
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|
0 | 6 | 148.0 | 72.0 | 35.0 | NaN | 33.6 | 0.627 | 50 |
1 | 1 | 85.0 | 66.0 | 29.0 | NaN | 26.6 | 0.351 | 31 |
2 | 8 | 183.0 | 64.0 | 29.0 | NaN | 23.3 | 0.672 | 32 |
3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 |
4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10 | 101.0 | 76.0 | 48.0 | 180.0 | 32.9 | 0.171 | 63 |
764 | 2 | 122.0 | 70.0 | 27.0 | NaN | 36.8 | 0.340 | 27 |
765 | 5 | 121.0 | 72.0 | 23.0 | 112.0 | 26.2 | 0.245 | 30 |
766 | 1 | 126.0 | 60.0 | 29.0 | NaN | 30.1 | 0.349 | 47 |
767 | 1 | 93.0 | 70.0 | 31.0 | NaN | 30.4 | 0.315 | 23 |
768 rows × 8 columns
diabetes_label=diabetes_df['Outcome']
diabetes_label
0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64
Next fit and transform our features dataset to the imputer object
imp.fit(diabetes_features)
IterativeImputer(add_indicator=False, estimator=None,
imputation_order='ascending', initial_strategy='mean',
max_iter=10000, max_value=None, min_value=None,
missing_values=nan, n_nearest_features=None, random_state=0,
sample_posterior=False, tol=0.001, verbose=0)
diabetes_features_arr=imp.transform(diabetes_features)
diabetes_features_arr
array([[ 6. , 148. , 72. , ..., 33.6 , 0.627, 50. ],
[ 1. , 85. , 66. , ..., 26.6 , 0.351, 31. ],
[ 8. , 183. , 64. , ..., 23.3 , 0.672, 32. ],
...,
[ 5. , 121. , 72. , ..., 26.2 , 0.245, 30. ],
[ 1. , 126. , 60. , ..., 30.1 , 0.349, 47. ],
[ 1. , 93. , 70. , ..., 30.4 , 0.315, 23. ]])
diabetes_features=pd.DataFrame(diabetes_features_arr,columns = diabetes_features.columns)
diabetes_features
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 72.0 | 35.0 | 218.937760 | 33.6 | 0.627 | 50.0 |
1 | 1.0 | 85.0 | 66.0 | 29.0 | 70.189298 | 26.6 | 0.351 | 31.0 |
2 | 8.0 | 183.0 | 64.0 | 29.0 | 269.968908 | 23.3 | 0.672 | 32.0 |
3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.000000 | 28.1 | 0.167 | 21.0 |
4 | 0.0 | 137.0 | 40.0 | 35.0 | 168.000000 | 43.1 | 2.288 | 33.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10.0 | 101.0 | 76.0 | 48.0 | 180.000000 | 32.9 | 0.171 | 63.0 |
764 | 2.0 | 122.0 | 70.0 | 27.0 | 158.815881 | 36.8 | 0.340 | 27.0 |
765 | 5.0 | 121.0 | 72.0 | 23.0 | 112.000000 | 26.2 | 0.245 | 30.0 |
766 | 1.0 | 126.0 | 60.0 | 29.0 | 173.820363 | 30.1 | 0.349 | 47.0 |
767 | 1.0 | 93.0 | 70.0 | 31.0 | 87.196731 | 30.4 | 0.315 | 23.0 |
768 rows × 8 columns
Now check if we have any missing values left.
diabetes_features.isnull().sum()
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
dtype: int64
Voila!! we have imputed all the missing data in our dataset.
diabetes_features.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 72.0 | 35.0 | 218.937760 | 33.6 | 0.627 | 50.0 |
1 | 1.0 | 85.0 | 66.0 | 29.0 | 70.189298 | 26.6 | 0.351 | 31.0 |
2 | 8.0 | 183.0 | 64.0 | 29.0 | 269.968908 | 23.3 | 0.672 | 32.0 |
3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.000000 | 28.1 | 0.167 | 21.0 |
4 | 0.0 | 137.0 | 40.0 | 35.0 | 168.000000 | 43.1 | 2.288 | 33.0 |
Now let’s concatenate our features dataset and label dataset to create the final cleaned
cleaned_diabetes_df=pd.concat([diabetes_features,diabetes_label],axis=1)
cleaned_diabetes_df
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6.0 | 148.0 | 72.0 | 35.0 | 218.937760 | 33.6 | 0.627 | 50.0 | 1 |
1 | 1.0 | 85.0 | 66.0 | 29.0 | 70.189298 | 26.6 | 0.351 | 31.0 | 0 |
2 | 8.0 | 183.0 | 64.0 | 29.0 | 269.968908 | 23.3 | 0.672 | 32.0 | 1 |
3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.000000 | 28.1 | 0.167 | 21.0 | 0 |
4 | 0.0 | 137.0 | 40.0 | 35.0 | 168.000000 | 43.1 | 2.288 | 33.0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10.0 | 101.0 | 76.0 | 48.0 | 180.000000 | 32.9 | 0.171 | 63.0 | 0 |
764 | 2.0 | 122.0 | 70.0 | 27.0 | 158.815881 | 36.8 | 0.340 | 27.0 | 0 |
765 | 5.0 | 121.0 | 72.0 | 23.0 | 112.000000 | 26.2 | 0.245 | 30.0 | 0 |
766 | 1.0 | 126.0 | 60.0 | 29.0 | 173.820363 | 30.1 | 0.349 | 47.0 | 1 |
767 | 1.0 | 93.0 | 70.0 | 31.0 | 87.196731 | 30.4 | 0.315 | 23.0 | 0 |
768 rows × 9 columns
diabetes.to_csv('datasets/diabetes_cleaned')
You can get the notebook used in this tutorial here and dataset used here
Thanks for reading!
Leave a comment