What is dummy variable trap?

  • What causes dummy variable trap?
  • How should one deal with it or avoid it?
1 Like

Using categorical data in Multiple Regression Models is a powerful method to include non-numeric data types into a regression model. Categorical data refers to data values that represent categories - data values with a fixed and unordered number of values, for instance, gender (male/female) or season (summer/winter/spring/fall). In a regression model, these values can be represented by dummy variables - variables containing values such as 1 or 0 representing the presence or absence of the categorical value.

Categorical variables can be divided into two subcategories based on the kind of elements they group:

  • Nominal variables are those whose categories do not have a natural order or ranking. Examples that fit in this category are gender, postal codes, hair colour, etc.

  • Ordinal variables have an inherent order which is somehow significant. An example would be tracking student grades where Grade 1 > Grade 2 > Grade 3. Another example would the socio-economic status of people where be the “high income” > “low income”.

Now that we know what categorical variables are, they have to be converted into meaningful numerical representations. This process is called encoding.

Encoding with pandas.get_dummies()

Let us look at the encoding technique provided by the pandas library. The pandas.get_dummies() function converts categorical variables into dummy or indicator variables. Let us look at an example here:

We can see that there are two categorical columns in the above dataset i.e. Genderand EducationField. Let’s encode them into numerical quantities using pandas.get_dummies() which returns a dummy-encoded data frame.

The column Gender gets converted into two columns — Gender_Female and Gender_Male having values as either zero or one. For instance, Gender_Female has a value = 1 at places where the concerned employee is female and value = 0 when not. The same is true for the column Gender_Male.


Let’s say we want to use the given data to build a machine learning model that can predict employees’ monthly salaries. This is a classic example of a regression problem where the target variable is MonthlyIncome. If we were to use pandas.get_dummies() to encode the categorical variables, the issue of multi-collinearity would arise. One of the assumptions of a regression model is that the observations must be independent of each other.

Check out What are the effects of multi-collinearity on linear regression models? to know more about multi-collinearity and why is it undesirable.

Multi-collinearity occurs when independent variables in a regression model are correlated.If we look closely, Gender_Female and Gender_Male columns are collinear. This is because a value of 1 in one column automatically implies 0 in the other.

We intended to solve the problem of using categorical variables, but got trapped by the problem of Multi-collinearity. This is called the Dummy Variable Trap, and put simply in this case it indicates to, Gender_Female = 1 - Gender_Male

Multi-collinearity is undesirable, and every time we encode variables with pandas.get_dummies(), we’ll encounter this issue. One way to overcome this issue is by dropping one of the generated columns. So, we can drop either Gender_Female or Gender_Male without potentially losing any information. Fortunately, pandas.get_dummies() has a parameter called drop_first which, when set to True, does precisely that.

Encoding with sklearn.preprocessing.OneHotEncoder

One hot encoding is used in regression models following label encoding. This enables us to create new attributes according to the number of classes present in the categorical attribute i.e if there are n number of categories in categorical attribute, n new attributes will be created, again known as dummy variables.
Additionally, one can use handle_unknown="ignore" to solve the potential issues due to rare categories.

Note, you can also drop one of the categories per feature in OneHotEncoder by setting the parameter drop='if_binary' or drop = 'first'.



Documentation:
pandas.get_dummies() : pandas.get_dummies — pandas 1.3.0 documentation
sklearn.preprocessing.OneHotEncoder : sklearn.preprocessing.OneHotEncoder — scikit-learn 0.24.2 documentation

2 Likes

nice article @tanyachawla . to add to this i have a query. how to encode ordinal variables for regression ? how will dummies capture the ranking of ordinal variables ?

In ordinal encoding, each unique category value is assigned an integer value.
scikit-learn.preprocessing.OrdinalEncoder that does the same. You can check out the following link:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#examples-using-sklearn-preprocessing-ordinalencoder

thanks @tanyachawla ! maybe also you can include in a future article on ordinal encoding what @aakashns-6l3 said yesterday of including weights. I am not sure, but i didn’t see it in ordinalencoder, but add weights makes better sense if we have to do regression

1 Like

Thanks @anubratadas for the suggestion!
You can check out the following link out:
How to encode ordinal variables?