While safety features in vehicles and enhanced traffic safety measures continue to evolve, vehicle collisions continue to be an unfortunate reality. These traffic accidents have many impacts on people, property and government agencies such as emergency services and health care.
According to data compiled by the City of Seattle, in the past year alone, there were over 7,000 collisions involving over 15,000 people and over 8,000 vehicles. The economic costs of these collisions are significant at over $151 million.
While collisions that result in property damage only are more prevalent, the economic cost of collisions that result in injury are much more significant. For example, in the past year, the 4,150 collisions that resulted in property damage only had an economic cost of 18.7 million US dollars while the 2,169 collisions that resulted in injury had an economic cost of 60.3 million US dollars. Serious injuries and fatalities have an even higher economic cost per collision.
The City of Seattle and Seattle health care agencies all have a vested interest in predicting the severity of collisions in order to inform their planning, budgeting and ultimately reduce the economic costs of collisions. Knowing which attributes are most relevant in determining severity and which machine learning model will provide the most accurate prediction of collision severity would be very helpful to these organizations.
The City of Seattle has made data collected around collisions since 2004 publicly available in keeping with government open data practices. While the City of Seattle classifies severity into five different categories, Coursera has used the data and divided the severity into two categories of for prediction purposes. These are injury and property damage only. The data is located in a CSV file that can be found here.
The dataset includes details about 194,673 collisions where there was either property damage only or injury. It includes the following target variable of collision severity and 38 different columns of data associated with each collision.
Examining the dataset
In the dataset there were obviously many which would not be helpful in predicting collision severity. The dataset was examined in more detail in order to determine which attributes would be helpful in a machine learning model.
The dataset contained many columns with large numbers of missing values. However, before deciding what to do about those missing values, it would be helpful to know whether those values are needed at all. For example, while the intersection key had 129,603 missing variables, unique identifiers, such as keys, would not be a valuable in binary classification model as they couldn’t be binned into a smaller number of categories.
In looking at the data types, the majority of the attributes were objects. Again, before converting attributes into integers for classification purposes, it would be helpful to do some preliminary analysis to see if the attribute itself would be helpful in a classification model.
Removing unneeded attributes
There are obviously some attributes that will not be helpful to predict collision severity using a classification model. These attributes include unique identifiers or keys, descriptive information or information that is too specific as well as redundant columns. After removing these columns, there were 17 attributes left along with our target variable for further examination
Exploratory Data Analysis
Each remaining attribute was examined to look for trends, patterns, skewed information and correlation. This was done with the understanding that the purpose is to determine which attributes would be best to use in predicting collision severity.
The first variable to be examined was collision severity where a 2 in the column “SEVERITY CODE” corresponds to “Injury Collision” in the “SEVERITYDESC” column and 1 corresponds to “Property Damage Only.” Around 70% or a little more than two-thirds of the collisions in the dataset are property damage only. This means that we would expect, on average, that collisions with property damage only would be a little more than double those with injury collisions. As we examine our attributes, this needs to be kept in mind to look for instances where this ratio is uneven. This could indicate an attribute that would be helpful in determining severity.
Relationship between Collision Type and Remaining Attributes
Each attribute was examined separately to determine if there were obvious differences in proportions of severity for different categories in the column. For example, for the attribute address type, it shows that whether a collision occurs at intersections versus blocks affects the severity of the collision. Thus, Address Type would be good attributes to use in predicting collision severity.
All 17 attributes were examined in a similar manner. While many of the attributes examined would be useful in determining the likelihood (i.e. number) of collisions based on time, place, or environmental conditions, fewer attributes seem to impact collision severity. The following attributes and associated categories have been determined to have an impact on collision severity and will be used in various machine learning models to determine which one is best at predicting collision severity.
- Address Type: Block, Intersection
- Collision Type: Angles, Sideswipe, Parked Car, Other, Cycles, Rear Ended, Head On, Left Turn, Pedestrian, Right Turn
- Pedestrian Right of Way Not Granted
Pre-processing of the Featured Data
Within the chosen attribute columns, there are blank entries that need to be addressed. We also need to convert all categorical features to numerical values and balance the dataset.
For the attributes “Address Type” and “Collision Type” there are very few blank entries as a percentage of the overall dataset, so these rows will simply be eliminated. While there are many blank entries for “Pedestrian Right of Way Not Granted” and “Speeding”, this is because it is only noted when it is a factor. Thus all the blank entries can assumed to be “N” as opposed to the entered “Y”. The categorical values “N” and “Y” are also replaced by numerical values “0” and “1” for classification purposes.
One hot encoding was used to convert all the categorical values in “Address Type” and “Collision Type” to binary variables and then they were concatenated to the dataframe.
Balance target data
There are far more collisions that result in property damage than in injury. Thus, when the classification model is trained, it will be biased. As our dataset is so large, we can correct this by undersampling. Undersampling will randomly delete some of the rows from the property damage only observations so that we will have equal numbers of property damage and injury observations.
Predictive Classification Modeling
The dataset was split into training and testing sets to build an accurate machine learning model. The following classification algorithms were trained and tested:
- K Nearest Neighbout(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression
K Nearest Neighbor(KNN)
In KNN classification, the collision is classified as resulting in injury or property damage only by it being classified by what is most common amoung its k nearest neighbours. The KNN algorithm was looped in order to determine the most accurate value of k to use. Values of KNN accuracy using different values of k ranged from 0.63 to 0.70. It was found that k=11 had the highest accuracy.
In decision trees, each attribute is a branch with the target classification being the final output. As there were four attributes (each with multiple categories), the max depth of the decision tree was limited to 4.
Support Vector Machine
Support Vector Machine is a supervised learning model that is very useful in binary classification models. Support vector machine uses a set of hyperplanes to separate the two target variables (injury and property damage only) as far apart as possible in order to determine where new collisions would fall and thus predict where new collisions would be classified. The “linear”, “poly”, “rbf”, and “sigmoid” kernels were all tested in order to determine which would provide the most accurate classification. While they were all very similar with accuracy scores of between 0.696 and 0.699, the “rbf” kernel produced the most accurate model and was used to train the model.
Logistic regression calculates the probability of the collision resulting in injury and then classifies those with a higher probability as being an injury collision and the others as being property damage only.
Each model was evaluated to determine the Jaccard similarity score and F1 score. The Jaccard similarity score measures the similarity between the training and test sets. The closer to 1.0, the better the similarity. The F1 score is the weighted average of the precision and recall and also has a best value of 1.0.
The following is a table of the evaluation results of each model.
The classification models K Nearest Neighbour, Support Vector Machine and Logistic Regression all had the same Jaccard similarity scores and F1 scores of 0.70 while the decision tree model had lower scores. However, while those three models resulted in the same scores, the processing time required to compute each of those models varied drastically. Support Vector Machine was very processing time intensive, followed by K Nearest Neighbour. Logistic regression and decision tree on the other hand, can be processed almost instantaneously. It is for this reason that I would recommend to the city of Seattle that they use the logistic regression model in order to predict whether a collision will result in injury or property damage only.
The City of Seattle and Seattle health care agencies all have a vested interest in predicting the severity of collisions in order to inform their planning, budgeting and ultimately reduce the economic costs of collisions. Through this study, attributes that were most likely to influence collision severity were identified and it was determined that a logistic regression model would be the fastest and most accurate model to predict collision severity. However, in order to improve the effectiveness of the model, the city of Seattle might consider adding additional data that influences severity. For example, I would anticipate that the type of roadway itself would also influence severity. A collision in a low speed residential sidestreet would probably be less likely to cause injury than one a high speed expressway. This is different from speeding which only tells you whether or not the driver exceeding the speed limit was a factor in the collision. Additional attributes could add significant value to the model by improving accuracy and precision.