Exploration of CAS (Crash analysis system) data

Data retrieval

First we need to retrieve the dataset from the Open Data portal. Multiple file formats are available (csv, kml, geojson, ...), the most compact being the .csv one.

Next we load the data and have a quick look to check if there no obvious loading error.

The dataset contains 72 columns, describing various aspects of the recorded car crashes. The full description of the fields is available online, see https://opendata-nzta.opendata.arcgis.com/pages/cas-data-field-descriptions.

Note that X and Y are geographical coordinates using the WGS84 coordinate system (see EPSG:4326).

Spatial features

First, we will look at the location of the crashes. More accidents happen in densier areas and it would be good to compare with population density.

Note: We removed Chatham island data here to ease plotting.

In dense aread, like in Auckland, there are enough crashes events to map the local road network.

At a coarser level, there is also the region information.

Temporal features

The dataset contains few temporal features:

So we won't be able to study daily, weekly and yearly patterns with these data.

If we look at the yearly counts, we can see some fluctuations, mostly driven by Auckland region but still noticeable in other parts of the country. Year 2020 is much lower as it's the current year.

We can also explore the spatio-temporal patterns too. Here we focus on Auckland (excluding 2020).

The other temporal attribute is the holiday. Christmas is the holiday period with most of the accidents. How the period is computed is not clear, so the larger amount of accident could be partly due to the time extent. Easter, Queens Birthday and Labour weekend are 3 to 4 days periods. Christmas & New Year is probably 1 to 2 weeks period.

Road Features

From the dataset fields description, the following features seem specific to the type of road:

Unfortunately, not all fields are actually available in the dataset.

The urban feature is derived from speedLimit, so we can probably remove it.

Environmental features

The environmental features are weather and sunhsine:

Possible next steps

We have checked the spatial, temporal, road and environmental features related to the accidents.

If these features inform us in which conditions there are more accidents relatively, we will need additional baseline information if we want to create a predictive model.

For the road features we could use a LINZ dataset or another NZTA dataset that brings more information about the road type and traffic. But then we need to attribute each crash to a road.

Another option would be to regrid the data, and for each cell containing at least one crash event we associate road features from the crash events. With this option, we don't make any prediction for cells of the grid where we don't have information about.

For the environmental information, we need weather information for all days in a year:

The prediction task can be formulated in different ways:

  1. exclude weather & holiday features, and fit a regression model with count data using year & location features (and accounting for traffic volume to compare the number of crashes per car on the road),

  2. group data by location, time, weather type (e.g. rain vs. no rain), and perform a binomial regression using the total number of days in each category (e.g. number of rain days for a particular location & year),

  3. predict crash severity from the whole dataset, assuming the non-severe crashes are a good proxy for normal conditions (weather, holidays, etc.).


Original computing environment