Working on ML model with different data sets


I’ve been working on a bank data set and the problem statement is to find the probability of defaulting the loan. I have given different data set but the problem is every data set has different features and size of the data as compared to the main file which has somewhere around 5.7 mil data. I wanted to understand the approach that how can I proceed with the problem statement, if I am merging all the data then it will going to create unnecessary null values. My solution is to randomly select the data from the main file and merge it with the other data that I have which is of the same size and proceed with model building.
Any possible solution???