Weights of Features

I scaled the numerical columns (age, BMI, children) and followed the code to combine the scaled data with the encoded categorical columns to create and train the linear regression model. However, when I display the list of features and their weights, I only see the correct weights applied for the scaled columns. What could be the issue here for the weights of the categorical columns?

Welcome to the community @justina.
Can you elaborate your question with some images/examples?

Hi @birajde, I am comparing the weights of features from encoded categorical fields vs. scaled numeric fields + encoded categorical fields.

I scaled the numeric fields and combined them with the encoded categorical fields before training the model.

    from sklearn.preprocessing import StandardScaler
    numeric_cols = ['age', 'bmi', 'children'] 
    scaler = StandardScaler()

    #transformed to scaled values
    scaled_inputs = scaler.transform(medical_df[numeric_cols])

    #combine with categorical fields
    cat_cols = ['smoker_code', 'sex_code', 'northeast', 'northwest', 'southeast', 'southwest']
    categorical_data = medical_df[cat_cols].values

    inputs = np.concatenate((scaled_inputs, categorical_data), axis=1)
    targets = medical_df.charges

    # Create and train the model
    scaled_model = LinearRegression().fit(inputs, targets)

    # Generate predictions
    predictions = scaled_model.predict(inputs)

    # Compute loss to evalute the model
    loss = rmse(targets, predictions)

scaled_weights_df = pd.DataFrame({
    'feature': np.append(numeric_cols + cat_cols, 1),
    'weight': np.append(scaled_model.coef_, scaled_model.intercept_)
scaled_weights_df.sort_values('weight', ascending=False)

These are the results of the weights for each feature. The left side is the features and weights from the medical_df after encoding the region column. On the right side, these are the results for the code above where it includes scaled and encoded fields.

The weights for the categorical columns are the same as in the left image. I saw in the video that the weights for the region including scaled and encoded fields were all the same and can’t figure out why that is not the case here.

So, you are basically asking why do the weights of the categorical column do not change but the weights of the numerical columns change after scaling the data?
For this, we need to know why do we scale the data?
When we have a feature/column whose value goes up to say 60k, it creates a skewed model. So we scale the data in a particular range here (0,1) so that the model does not produce a biased result based on the range of one feature. But in the case of categorical columns, after performing one hot encoding, the data is already in a range of (0,1) so it does not affect the model whereas the scaled columns affect the model by removing biases w.r.t to weights.
Hope this helps!

Does that mean that the output in the right above is correct?

I was confused why after scaling the data, the weights of the categorical data also changed in the video whereas my results did not change.

These are the results of the weights of each feature after scaling and combining with encoded fields as shown in the video.

When I compare my results to this, only the weights of numerical features are the same. I am not sure if I am missing a step somewhere to get the same results as shown above.

If you see the lecture notebook, the output resembles your notebook output. If your model gives reasonable output and less loss than it can be considered as correct/good model.

1 Like