6 Different Solutions to the Loan Forecasting Problem
Loan forecasting using probabilistic modeling, regression, and survival analysis gave different outputs
Two months ago, I proposed the loan status data science challenge problem via a medium article. The goal was to get participants to attempt the problem and share their version of the solution.
Till date, I’ve had feedback from 5 participants. Before sharing the output from these participants, here is the problem statement for the exercise.
Model for forecasting loan status
Instructions: In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:
- First, the borrower receives the funds. This event is called the origination.
- The borrower then makes regular repayments until one of the following happens:
(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.
(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid. In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:
- The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.
- For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.
We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine. Also, we expect that this project will not take more than 3–6 hours of your time.
The dataset for this problem can be downloaded from this GitHub repository.
Contributions from participant
As mentioned earlier, 5 participants have attempted this problem and shared their solutions. The goal of the project is to estimate what fraction of the loans provided in the dataset will have charged off by the time all of their 3-year terms are finished. Table 1 shows the method and % of charged off loans from the different participants, including my own solution.

From Table 1, we observe a large variance in the predicted values of charged off loans. The prediction from Monte-Carlo simulation seems to agree nicely with the results obtained using Survival Analysis. Even though Participant 1 and Participant 2 used the same method (Survival Analysis), they obtained slightly different predictions (14.8 % and 17.0%, respectively). Participants 3 and 4 obtained smaller predicted values, that is 10.0% and 6.8% respectively. Apart from Participant 5, all predicted values seem to be less than 15%. The prediction from Participant 5 seems to act as an outlier, contributing significantly to the overall variance of the predicted values. The mean predicted value of the % of charged off loans from all participants (including the original proposed solution from Benjamin Tayo) is 22.2% when Participant 5 is included, or 12.7% when Participant 5 is excluded.
Summary
In summary, we’ve analyzed the solution of the loan forecast data science project from different contributors. We observed that the predicted value of the % of charged off loans is in the range from 6.8% to 70%. Such a large variance clearly indicates that the solution to a machine learning project is subjective, and depends greatly on the experience of the data science aspirant.
I would like to challenge you to try to solve this problem yourself and let me know what your solution is. It is a very interesting problem. Please email me comments and a version of your solution to the following email address: benjaminobi@gmail.com
Also, if you are interested in the solutions from different participants, please let me know and I can send you the Jupyter notebook files or Python Scripts.
Additional data science/machine learning resources
Timeline for Data Science Competence
Essential Maths Skills for Machine Learning