One of the project datasets mentioned during the project review was “Home Credit Default Risk” which contains many CSV files including:
These CSV files have their own IDs as a form of table Primary Key to join on:
The merging of these datasets with separate IDs for obtaining a consolidated dataset gets messy.
I wonder if there is any recommendation/hints on merging these many files with different IDs?!
You could chain
new_df = df1.merge(df2, left_on="ID1", right_on="ID2") \
.merge(df3, left_on="ID1", right_on="ID3")
# and so on
For such small amount of files this should be easy to do in a moment.
Thanks for your comment.
Given the following screenshot of the dataset diagram in the Kaggle competition page, I am not sure the proper order/way in which I need to perform such merge (please note the dashed line as well) to capture the accurate one-to-one relationship as shown in the attached diagram (So that one can trace the merge result and analysis back to the original diagram when necessary).
X → Y && Y → X
I am suspecting that maybe we need to analyze the impact of different permutations of connected datasets on TARGET variable separately rather than consolidating all of them.
The lines (and image) are from probably a derivative of an UML notation, especially ER database diagram.
You can control how the merge happens (more or less) with the
how argument to
It seems that the notation is showing many-to-one relation with some tables. The dashed lines probably indicate that such column could have NULL (NaN, or whatever else), representing no current balance.
Even better would be to know what are you trying to achieve here
Thank you for follow up.
I am trying to construct (merge all CSV files required for training) a consolidated dataset (similar to what is already provided in all the course lessons) including all fields relevant to TARGET prediction so that I progress to model training.
In the course lessons, the only merge is just merging 2 or 3 files with the same IDs. But here, as evident from the UML diagram, it is like spiderweb connection of datasets with different IDs & connection types which makes the process of data consolidation somehow fuzzy.
This way I cannot trace back the origin of a particular decision to a particular CSV file in order to justify the accuracy & veracity of the model predictions.
I think this is kinda trickier than seems at the first look.
I think you should start training with only the current “loan applications” and see how it behaves.
If you’ll have the basic case covered, you can then focus on incorporating the additional data.
I would personally look first for previous applications, aggregating the data contained in them to create some sort of “one row per applicant”.
You could add columns like “average loan value”, “number of loans”, “maximum loan taken”, “number of successfull/failed applications” (if present, not sure here, not that pro at banking) and so on.
Not sure what’s the “behavioral data”, but it seems like this could be replicated values, the same for each applicant.
Thank you for clarification.
Sure. I try to build up the consolidated dataset in an incremental basis.