- What causes dummy variable trap?
- How should one deal with it or avoid it?
Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model. Categorical data refers to data values that represent categories - data values with a fixed and unordered number of values, for instance, gender (male/female) or season (summer/winter/spring/fall). In a regression model, these values can be represented by dummy variables - variables containing values such as 1 or 0 representing the presence or absence of the categorical value.
Categorical variables can be divided into two subcategories based on the kind of elements they group:
Nominal variables are those whose categories do not have a natural order or ranking. Examples that fit in this category are gender, postal codes, hair colour, etc.
Ordinal variables have an inherent order which is somehow significant. An example would be tracking student grades where Grade 1 > Grade 2 > Grade 3. Another example would the socio-economic status of people where be the “high income” > “low income”.
Now that we know what categorical variables are, they have to be converted into meaningful numerical representations. This process is called encoding.
Let us look at the encoding technique provided by the
pandas library. The
pandas.get_dummies() function converts categorical variables into dummy or indicator variables. Let us look at an example here:
We can see that there are two categorical columns in the above dataset i.e.
EducationField. Let’s encode them into numerical quantities using
pandas.get_dummies() which returns a dummy-encoded data frame.
Gender gets converted into two columns —
Gender_Male having values as either zero or one. For instance,
Gender_Female has a
value = 1 at places where the concerned employee is female and
value = 0 when not. The same is true for the column
Let’s say we want to use the given data to build a machine learning model that can predict employees’ monthly salaries. This is a classic example of a regression problem where the target variable is
MonthlyIncome. If we were to use
pandas.get_dummies() to encode the categorical variables, the issue of multi-collinearity would arise. One of the assumptions of a regression model is that the observations must be independent of each other.
Check out What are the effects of multi-collinearity on linear regression models? to know more about multi-collinearity and why is it undesirable.
Multi-collinearity occurs when independent variables in a regression model are correlated.If we look closely,
Gender_Male columns are collinear. This is because a value of
1 in one column automatically implies
0 in the other.
We intended to solve the problem of using categorical variables, but got trapped by the problem of Multi-collinearity. This is called the Dummy Variable Trap, and put simply in this case it indicates to,
Gender_Female = 1 - Gender_Male
Multi-collinearity is undesirable, and every time we encode variables with
pandas.get_dummies(), we’ll encounter this issue. One way to overcome this issue is by dropping one of the generated columns. So, we can drop either
Gender_Male without potentially losing any information. Fortunately,
pandas.get_dummies() has a parameter called
drop_first which, when set to
True, does precisely that.
One hot encoding is used in regression models following label encoding. This enables us to create new attributes according to the number of classes present in the categorical attribute i.e if there are n number of categories in categorical attribute, n new attributes will be created, again known as dummy variables.
Additionally, one can use
handle_unknown="ignore" to solve the potential issues due to rare categories.
Note, you can also drop one of the categories per feature in OneHotEncoder by setting the parameter
drop = 'first'.
pandas.get_dummies() : pandas.get_dummies — pandas 1.3.0 documentation
sklearn.preprocessing.OneHotEncoder : sklearn.preprocessing.OneHotEncoder — scikit-learn 0.24.2 documentation
nice article @tanyachawla . to add to this i have a query. how to encode ordinal variables for regression ? how will dummies capture the ranking of ordinal variables ?
In ordinal encoding, each unique category value is assigned an integer value.
scikit-learn.preprocessing.OrdinalEncoder that does the same. You can check out the following link:
thanks @tanyachawla ! maybe also you can include in a future article on ordinal encoding what @aakashns-6l3 said yesterday of including weights. I am not sure, but i didn’t see it in ordinalencoder, but add weights makes better sense if we have to do regression