Predicting claims in health insurance – Part II
Insights from our health insurance POC
This is the second part of a twopart blog describing one of the POCs we’ve recently completed in Remitrix. In this POC we aimed to predict the number of claims per year in two health insurance products: Ambulatory insurance and Surgery insurance.
In the previous part of this blog (here), we focused on the preparation to the modeling, looking at the data and sneaking a pick to two example features. We also presented some of the questions we had to answer before we could start training models.
We ended the previous blog with the following:
 Ambulatory training data with 568,260 records (with 29,932 claims) and 28 possible features (number of records altered to enhance confidentiality)
 Surgery training data with 685,818 records (with 20,904 claims) and 30 possible features
 A decision to define the problem as a classification problem where we wish to predict if an observation will make at least one claim in a specific year or not
 An understanding that in order to train a powerful model we’ll need to downsample our negative labels
 An understanding that we can’t just look at high accuracy values, or even high precision and recall but have to make sure that our overall predicted number of claims is accurate.
 A decision to use bootstrap samples to get an estimate of the distribution of estimators and not just a single “lucky” estimator
With those guidelines in hand, we are ready to move on to the next phase.
Binary classification with XGBoost and some other tricks
Once we decided to use a binary classification model the decision to use XGBoost (here) was almost immediate.
XGBoost is a decision tree boosting system, based on the gradient boosting idea used in GBM (gradient boosting machines), but implemented in a more efficient way and allowing more regularization options which help it avoid overfitting, and improve its predictive power. It is widely used for various kinds of problems and consistently outperforms standard GBM and Random Forest.
However, in order to meet all of the above requirements we had to do more than just use all of our training example to train the XGBoost model and hope for the best. This is the scheme we ended up following:
Perform B times:

 Train: create a bootstrap sample from train data. Train N XGBoost classifiers on the bootstrap sample, each time randomly down sample the negative observations such that 20% of the observations are positive examples (i.e. observations with one or two claims) and 80% are negative examples (i.e. observations without claims). The down sampling is needed since the claim rate in both products is very low (up to 5%).
 Each classifier results in a claim/no claim prediction to each observation (after using a threshold obtained as explained below), so that after N classifiers we have N predictions per observation. Perform majority vote per observation to obtain a decision per observation. Finally, account for the existence of records with 2 claims by multiplying all predicted “claim” observations by a constant factor calculated as $$\frac{number\:claims}{Unique\:observations\:with\:claims}$$

 Adjust threshold: Create a bootstrap sample from a data which was set aside as “validation”. Then, for each one of the N classifiers randomly choose 50% of the records in the bootstrap sample and use it to adjust the “yes” threshold as follows: The model outputs a claim probability for each observation. The decision which probability is high enough to be considered “claim” is taken such that this subset of the data gets a predicted claim rate that is as close as possible to its real claim rate. Since we have N classifiers, we’ll have N “adjusting” data sets and N threshold.
 Validate on remaining 50% of validation: This data set is used to choose between various considered models. Evaluate the performance of model by predicting claim/no claim on the observations in this data set (prediction is done in the same manner as for the train set). Evaluation is done by looking at the difference between the real number of claims and the predicted number (and the resulting claim rate)
 Test on 2014: predict on test set in the same manner used for the validation and train sets.
At the end of those B iterations we end up with B estimators of number of claims for the validation and test sets. We use the mean of the estimators as the final estimator of number of claims, and construct a 95% confidence interval around that estimator by taking the lower and upper 2.5% of the estimators distribution.
The results of this scheme were very good, and improved the classical actuarial approach by more than 7%, in both products. Below, we can see the distribution of the B estimators for our test set in the surgery product. X axis is the number of predicted claims and yaxis is the probability of getting the xaxis value according to the estimated distribution. The black dashed line represent the real number of claims, and its location – very close to the mode of the distribution is an evidence of how well the model fits our data. The values on the xaxis were removed to avoid disclosing true number of claims
Looking inside the “black box”
Good results are great, but we want to understand what the models learned and what was used to achieve those results. The feature importance output can help us with here: For each feature available in the training data, the algorithm outputs a number, ranging between 0 to 100 which reflects the relative importance of that feature for the modeling process, such that $$\sum_{i=1}^{p}importance_i=100$$. In our case, looking at the feature importance revealed that while the age of the insured and start year of the policy are meaningful, they are not the only important features. In fact, some of the most important features where related to the policy’s claiming history. We saw that policies with history of claims are more likely to claim again than policies with no past claims. We can look, for example, at the below boxplots. Boxplots visualize continuous features through their quartiles (the lower edge of the box represent the 1st quartile, and the top edge represent the 3rd quartile) The median (2nd quartile) is indicated using the band inside the box. The lines extending vertically from the boxes (whiskers) indicate variability outside the upper and lower quartiles. Outliers are plotted as individual points.
In the below box plots we see the distribution of the square root of the average number of past claims among policies in the test data with a claim (orange) and without a claim (blue).
We’ll start with the orange box, that actually looks like a box: we can see that 25% of observations with claim had no past claims (lower edge of the box is on 0 line). The median is i 0.5 and the 3rd quartile is ~0.9. The maximum average number of past claims is 4 (2^2). The blue box, on the other, is not really a box. All we can see is a line at 0 and than couple dozen of outlier points reaching a maximum of ~2.7. The “box” does not exists simply because 1st quartile = 2nd quartile = 3rd quartile = 0, meaning that over 75% of the observations with no claims also didn’t have any claims in the past. Clearly, there is a huge difference between the two boxes, i.e. there is a huge difference in the number of past claims between policies that had a claim in the current year and those who didn’t. So, this feature can be predictive of whether claim will be made by a policy.
Just like this feature, we had a few more informative features, and using them helped our model improve on the baseline actuarial solution.
Lessons learned
There is no use of performing a POC without drawing some conclusions from it. This specific POC is no different, and we’ve learned some valuable lessons which can be summarized by these 5 points:
 Know and understand your data – this is true for any ML problem, and the insurance world is no different. without understanding what’s in the data you can’t understand why a model is working (or not working). In our case, the cleaning and preparation phase (on which we didn’t elaborate here) could not have worked without a deep understanding of the data and the ability to identify errors and mistakes that simply do not make sense.
 Use prior knowledge but not blindly – actuaries have being predicting the number of claims (and also their severity) for many years now, And they’ve being doing a good job at is as well. So why not use it? Don’t just throw everything done to date and start from scratch. If you have prior knowledge – use it. However, don’t do use it blindly. It is possible that features that work in the actuarial model will not work in a ML model, or will need some modifications. The decision when to use past knowledge as is, when to make modifications and when to leave it out completely goes back to section 1 – if you understand what your data measures and how, it will be much easier to decide what make sense and what not.
 Don’t be afraid to innovate – in a complete continuation to section 2, just because you’re using prior knowledge and insights obtained from deep domain knowledge does not mean you shouldn’t try to innovate and introduce new methods and new features. In this POC we saw great predictive power in features based on the history of the policy and used them although this is not the common practice in the field. These features were a key part in the improved predictions.
 Always check your results and make sure they make sense and inline with what you expected. If not, dig deep to understand if your initial assumptions were wrong (and why) or if your model is learning something he is not suppose to (do you have a leakage in the data? are all your features solid?)