Lesson 3 - Decision Trees and Hyperparameters

:arrow_forward: Lecture Video will be available on the course page :point_up_2:

Topics Covered:

  • Downloading a real-world dataset
  • Preparing a dataset for training
  • Training and interpreting decision trees

:spiral_notepad: Notebooks used in this lesson:

:writing_hand: Please provide your valuable feedback on this link to help us improve the course experience.

:computer: Join the Jovian Discord Server to interact with the course team, share resources and attend the study hours :point_right: Jovian Discord Server

:question: Asking/Answering Questions

Reply to this thread to ask questions. Before asking, scroll through the thread and check if your question (or a similar one) is already present. If yes, just like it. We will give priority to the questions with the most likes. The rest will be answered by our mentors or the community. If you see a question you know the answer to, please post your answer as a reply to that question. Let’s help each other learn!


When I an running this line of code " %%time
model.fit(X_train, train_targets)" I am getting the error " AttributeError: ‘DecisionTreeClassifier’ object has no attribute ‘_validate_data’ "
I have checked the version of sklearn, using !pip list, it’s showing sklearn - 0.0
sklearn-pandas -1.8.0 .
if I try updating it by running " ! pip install scikit-learn==0.1.2" then it’s showing " ERROR: Could not find a version that satisfies the requirement scikit-learn==0.1.2 (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2, 0.21.0, 0.21.1, 0.21.2, 0.21.3, 0.22rc2.post1, 0.22rc3, 0.22, 0.22.1, 0.22.2, 0.22.2.post1, 0.23.0rc1, 0.23.0, 0.23.1, 0.23.2, 0.24.dev0, 0.24.0rc1, 0.24.0, 0.24.1, 0.24.2)
ERROR: No matching distribution found for scikit-learn==0.1.2 "

Try upgrading scikit learn to the latest version using pip install scikit-learn --upgrade

1 Like

sir why OnehotEncode is showing the valueerror of the input contains NaN

Hey, are you using colab? Please run these commands !pip install scikit-learn --upgrade and restart the notebook.

I cannot connect to the discord server! I continually get the error:-
“Invite invalid”
Please kindly advise!

Use this link to join the server → https://discord.gg/8wHMbUeb

1 Like

Many thanks! My country, Uganda is not among the countries for phone number verification!

why are we using raw_df here? and can we use train_inputs data frame as the numeric cols are derived from train_inputs so bot raw_df and train_inputs would give the same answer here
@birajde @hemanth

Probably yes, If you are doing every operations on the raw_df (imputing, scaling, encoding) and then splitting the data you can do it on raw_df but if you are splitting first, you have to do all operations on the train_inputs.

So, train_inputs should be there instead of raw_df, considering the instructor split the data in the beginning

In the lecture notebook, the instructor is fitting the data on the raw_df /main Dataframe. But transforming the data on the train_inputs, val_inputs etc. If the data was fit on train_inputs here all the dataframe(training, validation, test) would be imputed w.r.t. the training data, so we consider the raw dataframe to train the scaler/imputer and then transform on the training, validation, test dataframe respectively.
(Note: We are fitting the scaler/imputer on the raw_df not transforming on them).

so in a sense, its always better for the model to be fitted on the original dataframe (so it can learn all the datapoints rather than any other option like train, val etc.) and only transform or make changes on the copied dataframe( train_inputs being created by .copy() method)

but in this case: the answer would be the same irrespective of train_inputs or raw_df, as the numeric cols are same in both

1 Like

Ya, the numeric cols are the same, but the data in both raw_df and train_inputs are not the same. raw_df has properties of all train, validation, and test data. While train_inputs has the properties of only training data.

While using OrdinalEncoder, the following line of code is throwing an error

encoded_cols = list(encoder.feature_names_in_(categorical_cols))

Error: TypeError: ‘numpy.ndarray’ object is not callable