Exploration of the crash severity information in CAS data

In this notebook, we will explore the severity of crashes, as it will be the target of our predictive models.

But first, we ensure we have the data or download it if needed

and load it.

The CAS dataset has 4 features that can be associated with the crash severity:

To check the geographical distribution, we will focus on Auckland and replace discrete levels of crashSeverity with number to ease plotting.

Given the data set imbalance, we plot the local maxima to better see the location of more severe car crashes.

Few remarks coming from these plots:

The crash severity is probably a good go-to target, as it's quite interpretable and actionable. The corresponding ML problem is a supervised multi-class prediction problem.

To simplify the problem, we can also just try to predict if a crash is going to involve an injury (minor, severe or fatal) or none. Here is how it would look like in Auckland

Interestingly, the major axes do not pop up as saliently here, as we are averaging instead of taking the local maxima.

This brings us to to the another question: is the fraction of crash with injuries constant fraction of the number of crashes in an area? This would imply that a simple binomial model can model locally binned data.

We first discretize space into 0.01° wide cells and count the total number of crashes in each cell as well as the number of crashes with injuries.

For each number of crashes in cells, we can check the fraction of crashes with injuries. Here we see that cells with 1 or few crashes have a nearly 50/50 chance of injuries, compared to cells with a larger number of accidents, where it goes down to about 20%.

Then we can also check how good is a binomial distribution at modeling binned data, using it to derive a 95% predictive interval.

The predictive interval seems to have a poor coverage, overshooting the high counts regions and being to narrow for the regions with hundreds of crashes. We can compute the empirical coverage of these interval to check this.

So it turns out that on a macro scale, the coverage of this simple model is quite good, but if we split by number of crashes, the coverage isn't so good anymore for the cells with higher number of crashes.

Hence, including the number of crashes in a vicinity could be an relevant predictor for the probability of crash with injury.


Original computing environment