Correlations in terms of encoded variables with target variables

Hello, lets consider a case where the dataset has numerical and categorical columns. To fit the categorical data, we encode it.

Now after the encoding, when we try to evaluate correlations, if a particular categorically encoded feature has highest correlation with the target variable, what exactly should be considered as the highest correlation feature, the sub-feature ( encoding makes the categorical features discreet, which is not part of the original dataset feature matrix) or the parent feature itself

e.g. (smoker → smoker_no, smoker_yes, and smoker_yes is the highest correlated feature to target insurance premium)

The sub-feature has the highest correlation, it would still be a good idea to keep the other sub-features around.

Hi @birajde , I think I have not conveyed my query correctly

I am not just talking in terms of the encoded (categorical) variables, but for all features in general when trying to decide (interpret) the most impactful component

Eg. In assignment 1, after encoding we have the following features with most impact to house prices:

In order of higher impact:

125 RoofMatl_ClyTile (If house roof material is made of clay)

102 Condition2_PosN

275 PoolQC_Ex (If pool quality is excellent)

15 GrLivArea (Numeric feature)

277 PoolQC_Gd (If pool_quality is good)

Now, if my business team were to ask me tell me the top x features that impact house prices, then how should I answer this?

A) The most impactful feature is if the roof material is made of clay
-----> One concern with mentioning this, is that this feature RoofMatl_ClyTile is not part of the original feature matrix! This is a feature that I have created for the data to be processed in my model, so can I mention it still?

B) Or, should I just mention, the most impacting feature is the roof material (RoofMatl)
--------> The concern with this…is that the other roof materials do not have as much of a correlation with the house price except for the encoded sub-feature mentioned above RoofMatl_clyTile

My query is how to interpret with correlations of encoded variables with respect to the original feature matrix. Is it okay to create features of our own, which are not part of the original dataset matrix?