Mean Imputer using raw_df or train_input

We did mean imputation by fitting on raw_df. Ain’t it should be fitted only on train_inputs and test_inputs should also be imputed with the values we got using fitting on train_inputs. Otherwise it will result in overfitting as we will consider test_inputs values also for overfitting.

1 Like

can u simplify ur query :slight_smile:

I too had the same question. So, the imputer was fit on the raw_df and then it was used to transform train_df, test_df and val_df. Shouldn’t it be fit on corresponding df rather than whole data? @hemanth

I think by fitting on the entire dataset(which we did in notebook) will result in “train-test-contamination”. Just check below link for train-test contamination.

Here we are discussing about the imputer. we fit the imputer to the entire dataset. This only ensure to replace the null with mean values.

“train-test-contamination” happens when we are not distinguish training data from validation data while training the machine learning model. This is not applicable to the imputer implementation.

Also imputer is applicable to numerical columns only

While the above is discussion on imputation, the same thing happens to scaling.

At 1:27:02 in the video, Aakash did not explain properly that there is max = 0.568276 (for Evaporation), not 1.0 like other columns do. The reason is that the scalers are computed based on raw_df[numeric_cols], i.e.,[numeric_cols]). But, you used these scalers to scale train_inputs, with train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols]). If you were to scale raw_df, the max for all the columns, including Evaporation, are 1.0.

In my opinion, the split of training set, validation set, and test set should come in the beginning, i.e. before any pre-preprocessing such as imputation, scaling. The imputation and scaling should be based on the training set and validation set as a whole, not separately, and not including test set. When testing the model with the test set, the test set should go through the same pre-processing steps using the parameters that have been obtained/used from the training set and validation set.