 What causes dummy variable trap?
 How should one deal with it or avoid it?
Using categorical data in Multiple Regression Models is a powerful method to include nonnumeric data types into a regression model. Categorical data refers to data values that represent categories  data values with a fixed and unordered number of values, for instance, gender (male/female) or season (summer/winter/spring/fall). In a regression model, these values can be represented by dummy variables  variables containing values such as 1 or 0 representing the presence or absence of the categorical value.
Categorical variables can be divided into two subcategories based on the kind of elements they group:

Nominal variables are those whose categories do not have a natural order or ranking. Examples that fit in this category are gender, postal codes, hair colour, etc.

Ordinal variables have an inherent order which is somehow significant. An example would be tracking student grades where Grade 1 > Grade 2 > Grade 3. Another example would the socioeconomic status of people where be the â€śhigh incomeâ€ť > â€ślow incomeâ€ť.
Now that we know what categorical variables are, they have to be converted into meaningful numerical representations. This process is called encoding.
Encoding with pandas.get_dummies()
Let us look at the encoding technique provided by the pandas
library. The pandas.get_dummies()
function converts categorical variables into dummy or indicator variables. Let us look at an example here:
We can see that there are two categorical columns in the above dataset i.e. Gender
and EducationField
. Letâ€™s encode them into numerical quantities using pandas.get_dummies()
which returns a dummyencoded data frame.
The column Gender
gets converted into two columns â€” Gender_Female
and Gender_Male
having values as either zero or one. For instance, Gender_Female
has a value = 1
at places where the concerned employee is female and value = 0
when not. The same is true for the column Gender_Male
.
Letâ€™s say we want to use the given data to build a machine learning model that can predict employeesâ€™ monthly salaries. This is a classic example of a regression problem where the target variable is MonthlyIncome.
If we were to use pandas.get_dummies()
to encode the categorical variables, the issue of multicollinearity would arise. One of the assumptions of a regression model is that the observations must be independent of each other.
Check out What are the effects of multicollinearity on linear regression models? to know more about multicollinearity and why is it undesirable.
Multicollinearity occurs when independent variables in a regression model are correlated.If we look closely, Gender_Female
and Gender_Male
columns are collinear. This is because a value of 1
in one column automatically implies 0
in the other.
We intended to solve the problem of using categorical variables, but got trapped by the problem of Multicollinearity. This is called the Dummy Variable Trap, and put simply in this case it indicates to, Gender_Female = 1  Gender_Male
Multicollinearity is undesirable, and every time we encode variables with pandas.get_dummies(),
weâ€™ll encounter this issue. One way to overcome this issue is by dropping one of the generated columns. So, we can drop either Gender_Female
or Gender_Male
without potentially losing any information. Fortunately, pandas.get_dummies()
has a parameter called drop_first
which, when set to True
, does precisely that.
Encoding with sklearn.preprocessing.OneHotEncoder
One hot encoding is used in regression models following label encoding. This enables us to create new attributes according to the number of classes present in the categorical attribute i.e if there are n number of categories in categorical attribute, n new attributes will be created, again known as dummy variables.
Additionally, one can use handle_unknown="ignore"
to solve the potential issues due to rare categories.
Note, you can also drop one of the categories per feature in OneHotEncoder by setting the parameter drop='if_binary'
or drop = 'first'
.
Documentation:
pandas.get_dummies()
: pandas.get_dummies â€” pandas 1.3.0 documentation
sklearn.preprocessing.OneHotEncoder
: sklearn.preprocessing.OneHotEncoder â€” scikitlearn 0.24.2 documentation
nice article @tanyachawla . to add to this i have a query. how to encode ordinal variables for regression ? how will dummies capture the ranking of ordinal variables ?
In ordinal encoding, each unique category value is assigned an integer value.
scikitlearn.preprocessing.OrdinalEncoder
that does the same. You can check out the following link:
thanks @tanyachawla ! maybe also you can include in a future article on ordinal encoding what @aakashns6l3 said yesterday of including weights. I am not sure, but i didnâ€™t see it in ordinalencoder, but add weights makes better sense if we have to do regression
Thanks @anubratadas for the suggestion!
You can check out the following link out:
How to encode ordinal variables?