Group 20
Dhruv Samdani, Devashru Patel, Eric Ricci, Thomas Malchodi
Road incidents are one of the leading causes of death worldwide. Automobile companies constantly try to innovate their safety mechanisms to minimize the risk of such injuries and fatalities. Often these focus on mitigating damage once an accident has already occurred, such as improved air bags. However, it is also important to analyze the factors that go into causing fatal accidents such as driver behavior, environment, etc. The goal of this project is to identify the components that are reliable predictors of accidents and use them to predict injury level.
The data set used in this analysis comes from the Fatality Analysis Reporting System (FARS) run by the National Highway Traffic Saftey Administration (NHTSA). The data is collected on crashes that occur throughout the US and includes all crashes resulting in injury or fatality in 2018.
The full data set includes a wealth of information on outcomes for each vehicle and person involved in the crash, as well as a number of variables relating to the crash and the emergency response. In total, it includes over 80,000 data points representing persons involved in the crash and 61 input features. The features range from type of road to collision orientation and many others. The outcome for each individual has 5 possible levels of injury: "No Apparent Injury", "Possible Injury", "Suspected Minor Injury", "Suspected Serious Injury", and "Fatal Injury". We hope to perform an analysis on the features present in the data and determine factors that are the most likely to predict this level of injury.
Factors that occured after the crash, such as emergency response time and time of death were ignored. This decision was made in part because it would be trivial to determine the injury level with features such as whether or not the person was taken to the hospital or the time of death. This also does not help accomplish the goal of detemrining how to improve safety precautions in vehicles, so we excluded all features that occured as a result of the crash. This reduced the number of features to 17 and when incomplete data was trimmed from the set the final data had approximately 60,000 data points. The final change was grouping injury severity into 3 outcomes: "No Injury", "Injury", and "Fatality" as five categories would lead to poor prediction accuracy.
Preliminary data analysis focused on choosing the relevant factors to remove extraneous factors and to improve runtime of model training. Unsupervised methods used included using mutual information to select the most relevant columns of data. Catagorical factors were expanded with binary dummy variables to allow for inclusion of all possible factors, so PCA was also applied to reduce dimensionality and speed up model training since the majority of feature were categorical. Supervised learning methods were used on the final data set to predict the injury level of people in the test data set. Methods tested include decision trees, neural networks, SVM, and random forests to classify each person by injury level.
The above plot shows the normalized mutual information between each of the dataset features and the injury severity. The top 7 features are pictured out of 16 tested. This shows that the 3 best features for determining the injury level are the first harmful event in a crash, the use of protection by the person, and whether the person was fully or partially ejected from a vehicle. This suggests that in order to prevent injury in crashes focus should be placed on making sure that all passengers are using appropriate protection, and in ensuring redesigning vehicles to ensure that occupants are not ejected.
The above chart shows the relationship between injury severity and a feature that has high mutual information. There is a lot of distinction in outcomes between being ejected and not being ejected.
This chart shows the relationship between injuries and a low mutual information feature. It is evident that the proportions are much more similar than a high information feature.
Each model was trained and tested using 5 fold cross validation to assess effectiveness at predicting injury level of a person. With 3 injury levels the highest prediction accuracy was 72% using linear SVM with the a K value of 16. A K value of 12 is considered to be optimal in this case, as there is very little increase in any performance metrics with higher K values, while model training time increases substantially. At this value prediction accuracy, precision, recall and F1-score are all 0.71. Each model was tested with a number of different parameteres to determine the best set, as described below. Given the large number of factors that involved in a crash and the level of randomness in how injuries occur, these levels of prediction accuracy are considered high.
The confusion matrix shows the ability of our model to predict the correct injury level of a person involved in a crash.
The above chart shows the influence of K values on the model.