In this 2-part blog post we’ll try to give you a taste of one of our recently completed POC demonstrating the advantages of using Machine Learning (read here) to predict the future number of claims in two different health insurance product. The first part includes a quick review the health

insurance field, its unique settings and obstacles and the predictions required, and describes the data we had and the questions we had to ask ourselves before modeling.  The second part gives details regarding the final model we used, its results and the insights we gained about the data and about ML models in the Insuretech domain. 
So, without any further ado – let’s dive in to part I ! 
 
Health insurance
Health insurers offer coverage and policies for various products, such as ambulatory, surgery, personal accidents, severe illness, transplants and much more.  
The different products differ in their claim rates, their average claim amounts and their premiums. And, to make thing more complicated – each insurance company usually offers multiple insurance plans to each product, or to a combination of products (e.g. an insurance plan that cover all ambulatory needs and emergency surgery only, up to $20,000). Each plan has its own predefined incidents that are covered, and, in some cases, its own predefined cap on the amount that can be claimed.  
Given the variety of products and plans (and that’s without even mentioning the fact that health claim rates tend to be relatively low and usually range between 1% to 10%,) it is not surprising that predicting the number of health insurance claims in a specific year can be a complicated task. Actuaries are the ones who are responsible to perform it, and they usually predict the number of claims of each product individually (not necessarily differentiating between various insurance plans). 
The Data
We decided to focus on two products: Surgery insurance and Ambulatory insurance. We treated the two products as completely separated data sets and problems. We found out that while they do have many differences and should not be modeled together – they also have enough similarities such that the best methodology for the Surgery analysis was also the best for the Ambulatory insurance. 
 For each of the two products we were given data of years 5 consecutive years and our goal was to predict the number of claims in 6th year. The full process of preparing the data, understanding it, cleaning it and generate features can easily be yet another blog post, but in this blog we’ll have to give you the short version – after many preparations we were left with those data sets (absolute numbers were altered by the same factor in order to enhance confidentiality):
 1. Ambulatory insurance – 568,260 records in the train set with claim rate of 5.26%. Taking a look at the distribution of claims per record:
Number of claims
Number of records
% of the data
0
541,362
95.26%
1
23,864
4.19%
2
3,034
0.5%
 2. Surgery insurance – This train set is larger: 685,818 records. Claim rate, however, is lower – standing on just 3.04%. The distribution of number of claims is:
Number of claims
Number of records
% of the data
0
665,722
97.08%
1
19,288
2.81%
2
808
0.1%
Both data sets have over 25 potential features. Again, for the sake of not ending up with the longest post ever, we won’t go over all the features, or explain how and why we created each of them, but we can look at two exemplary features which are commonly used among actuaries in the field: 
  1. Age – age is probably the first feature most people would think of in the context of health insurance: we all know that the older we get, the higher is the probability of us getting sick and require medical attention. In the below graph we can see how well it is reflected on the ambulatory insurance data (same trend was observed for the surgery data). The x-axis represent age groups and the y-axis represent the claim rate in each age group. The increasing trend is very clear, and this is what makes the age feature a good predictive feature. 
  1. Policy seniority – This feature may not be as intuitive as the age feature – why would the seniority of the policy be a good predictor to the health state of the insured? However, it is. It can be due to its correlation with age (a policy that started 20 years ago probably belongs to an older insured) or because in the past policies covered more incidents than newly issued policies and therefore get more claims, or maybe because in the first few years of the policy the insured tend to claim less since they don’t want to raise premiums or change the conditions of the insurance. Either way, looking at the claim rate as a function of the year in which the policy opened (which is equivalent to the policy’s seniority), again looking at the ambulatory product, we clearly see the higher claim rates for older policies (lower start years)
Some of the other features we considered showed possible predictive power, while others seem to have no signal in them. Take for example the “smoker” feature. This feature equals 1 if the insured smokes, 0 if she doesn’t and 999 if we don’t know (which was the most common category, unfortunately). Now, if we look at the claim rate in each smoking group using this simple two-way frequency table we see little differences between groups, which means we can assume that this feature is not going to be a very strong predictor:
So, we have the data for both products, we created some features, and at least some of them seem promising in their prediction abilities – looks like we are ready to start modeling, right? 
Well, not quite..
 
A simple regression problem? 
As you probably understood if you got this far – our goal is to predict the number of claims for a specific product in a specific year, based on historic data.
If you have some experience in Machine Learning and Data Science you might be asking yourself “Ok, so we need to predict for each policy how many claims it will make. This sounds like a straight forward regression task!” 
Well, no exactly. There were a couple of issues we had to address before building any models:
  1. Is this even a regression problem? On the one hand, a record may have 0, 1 or 2 claims per year so our target is a count variable – order has meaning and number of claims is always discrete. Those setting fit a Poisson regression problem (see here). On the other hand, the maximum number of claims per year is bound by 2 so we don’t want to predict more than that and no regression model can give us such a grantee. In addition, only 0.5% of records in ambulatory and 0.1% records in surgery had 2 claims. With such a low rate of multiple claims, maybe it is best to use a classification model with binary outcome: claim/ no claim? Maybe we should have two models – first a classifier to predict if any claims are going to be made and than a classifier to determine the number of claims (1 or 2)? We explored several options and found that the best one, for our purposes (see section 3) was actually a single binary classification model where we predict for each record claim/no claim. We had to do a small adjustment to account for the records with 2 claims, but you’ll have to wait to part II of this blog to read more about that winking face 
  2. Imbalanced data – our “yes” labels (positive) are records which made at least one claim, and our “no” labels (negative) are records without any claims. Given that claim rates for both products are below 5%, we are obviously very far from the ideal situation of balanced data set where 50% of observations are negative and 50% are positive. Imbalanced data sets are a known problem in ML and can harm the quality of prediction, especially if one is trying to optimize the accuracy  of the model. Accuracy  is defined as the fraction of correctly predicted outcomes out of the entire prediction vector. So, in a situation like our surgery product, where claim rate is less than 3% a classifier can achieve 97% accuracy by simply predicting “no” to all observations! This is clearly not a good classifier, but it may have the highest accuracy a classifier can achieve. There are many techniques to handle imbalanced data sets (see here for example). Luckily for us, using a relatively simple one like under-sampling did the trick and solved our problem.  
  3. We don’t want just good claim probabilities!  – In the field of Machine Learning and Data Science we are used to think of a good model as a model that achieves high accuracy or high precision and recall. And those are good metrics to evaluate models with. But,  in this case, our goal is not necessarily to correctly identify the people who are going to make a claim, but rather to correctly predict the overall number of claims. This may sound like a semantic difference, but it’s not. We already say how a “stupid” model can achieve 97% accuracy on our data. Now, let’s understand why adding precision and recall is not necessarily enough: 
  • Say we have 100,000 records on which we have to predict. Claim rate is 5%, meaning 5,000 claims. Now, let’s also say that we’ve built a mode, and it’s relatively good: it has 80% precision and 90% recall.
  • $$Recall= \frac{True\: positive}{All\: positives} = 0.9 \rightarrow \frac{True\: positive}{5,000} = 0.9 \rightarrow True\: positive = 0.9*5,000=4,500$$
  • We plug it into our precision equation:$$Precision = \frac{True\: positive}{True\: positive\: +\: False\: positive} = 0.8 \rightarrow \frac{4,500}{4,500\:+\:False\: positive} = 0.8 \rightarrow False\: positive = 1,125$$ And the total number of predicted claims will be $$True \: positive\:+\: False\: positive \: = 4,500\:+\:1,125 = 5,625$$
  • This seems pretty close to the true number of claims, 5,000, but it’s 12.5% higher than it and that’s too much for us! Alternatively, if we were to tune the model to have 80% recall and 90% precision (and again, this is considered a good model) our expected number of claims would be 4,444 which is an underestimation of 12.5%. 
  • Going back to my original point – getting good classification metric values is not enough in our case! And it’s also not even the main issue. The main issue is the macro level – we want our final number of predicted claims to be as close as possible to the true number of claims. In the next blog we’ll explain how we were able to achieve this goal.
  1. We also don’t want just a final prediction – the last issue we had to solve, and also the last section of this part of the blog, is that even once we trained the model, got individual predictions, and got the overall claims estimator it wasn’t enough. We had to have some kind of confidence intervals (here) or at least a measure of variance for our estimator in order to understand the volatility of the model and to make sure that the results we got were not just “lucky”. Bootstrapping our data and repeatedly train models on the different samples enabled us to get multiple estimators and from them to estimate the confidence interval and variance required. 
In the next part of this blog we’ll finally get to the modeling process! And, just as important, to the results and conclusions we got from this POC slightly smiling face