Hi guys, sorry to bother again. But I got myself into a troubled situation as follow:
- total dataset rows: 180K
- total target rows: 180K with value between 0 and 100
- I have one feature which has 50K values, and 130K NaN. This feature has some explaining power on the target, so drop it would be a loss for the model. However the 50K values ranging from 0 to 100, too as result of a question asking if user likes the artist. So NaN means that the user didn’t answer the question. I am considering imputing the ‘mean’ value of 48/50 to the 130K NaN, however I’m not that sure about this operation. Since NaN could means that user doesn’t really like the artist that much, but what should I do? I don’t want to skew the dataset.
That’s tough, since over 50% of your values are NaN. If you wish to use the mean, then you could plot the distribution of these values and see if they are normally distributed, in which case you may be able to get away with the mean. Otherwise you could try the median (and also see how many distributions you actually have - multiple modes may be present). A different approach would be to use a regression-based imputation approach. That is to say: try to model this feature as a function of other features in your dataset by looking for patterns and using this relationship to impute missing values. For instance, if you find that your feature (call it x_2) is linearly correlated with say another feature (call it x_3), then something along the lines of x_2 = m*x_3 + c would do a better job of imputating missing values rather than a single-point estimate (eg mean, median).