Unsolvable Machine Learning Problems
Sometimes, you simply don’t have the tools in your toolbox for tackling a machine learning problem
An unsolvable machine learning problem is a problem that you can’t solve. This doesn’t mean that the solution to the problem doesn’t exist. It simply means that you don’t have the necessary tools in your toolbox for tackling the problem. As a data scientist or data science aspirant, you can only solve problems that you have the right set of tools for. That is why it is important for a data scientist to stay on top of his or her game and to continuously learn new skills.
Platforms to help you stay up-to-date
As the field of data science is continuously changing, it is important that a data scientist continues to update his/her knowledge so as to stay current with the latest technological advances in the field. Platforms that can help you to remain up-to-date include the following:
A case study of an unsolvable machine learning problem
Two years ago, I had a screening interview with a financial company that uses data science and analytics to predict the credit worthiness of it’s customers to determine how likely they are capable of repaying a loan in full. As part of the interview process, I was assigned a take-home challenge problem. Please see below for the project description and instructions.
Model for forecasting loan status
Instructions: In this problem, you will forecast the outcome of a portfolio of loans. Each loan is scheduled to be repaid over 3 years and is structured as follows:
- First, the borrower receives the funds. This event is called the origination.
- The borrower then makes regular repayments until one of the following happens:
(i) The borrower stops making payments, typically due to financial hardship, before the end of the 3-year term. This event is called charge-off, and the loan is then said to have charged off.
(ii) The borrower continues making repayments until 3 years after the origination date. At this point, the debt has been fully repaid. In the attached CSV, each row corresponds to a loan, and the columns are defined as follows:
- The column with header days since origination indicates the number of days that elapsed between origination and the date when the data was collected.
- For loans that charged off before the data was collected, the column with header days from origination to charge-off indicates the number of days that elapsed between origination and charge-off. For all other loans, this column is blank.
We would like you to estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished. Please include a rigorous explanation of how you arrived at your answer, and include any code you used. You may make simplifying assumptions, but please state such assumptions explicitly. Feel free to present your answer in whatever format you prefer; in particular, PDF and Jupyter Notebook are both fine. Also, we expect that this project will not take more than 3–6 hours of your time.
My Version of the Solution
At the time when I interviewed with the company, I didn’t know how to solve this problem. I simply didn’t have the tools for tackling the problem. So I attempted a solution using probabilistic modeling based on Monte-Carlo simulation. For the question: Estimate what fraction of these loans will have charged off by the time all of their 3-year terms are finished?
My model produced a 95% confidence interval for the fraction of loans that will charge off after 3 years to be 14.8% +- 0.2%.
Contributions from participant
Two months ago, I proposed this project via a medium article. The goal was to get participants to attempt the problem and share their version of the solution. Till date, I’ve had feedback from 5 participants. Table 1 shows the method and % of charged off loans from the different participants, including my own solution.
From Table 1, we observe a large variance in the predicted values of charged off loans. The prediction from Monte-Carlo simulation seems to agree nicely with the results obtained using Survival Analysis. Even though Participant 1 and Participant 2 used the same method (Survival Analysis), they obtained slightly different predictions (14.8 % and 17.0%, respectively). Participants 3 and 4 obtained smaller predicted values, that is 10.0% and 6.8% respectively. Apart from Participant 5, all predicted values seem to be less than 15%. The prediction from Participant 5 seems to act as an outlier, contributing significantly to the overall variance of the predicted values. The mean predicted value of the % of charged off loans from all participants (including the original proposed solution from Benjamin Tayo) is 22.2% when Participant 5 is included, or 12.7% when Participant 5 is excluded.
It seems that Sentimental Analysis is the best method for solving the loan status problem.
I would like to challenge you to try to solve this problem yourself and let me know what your solution is. It is a very interesting problem. Please email me comments and a version of your solution to the following email address: firstname.lastname@example.org
Also, if you are interested in the solutions from different participants, please let me know and I can send you the Jupyter notebook files or Python Scripts.