My Introduction to Machine Learning Models

Henry Alpert
5 min readJan 4, 2021
Photo by George Huffman on Unsplash

Say an accident occurred between three vehicles on a September morning. Roads were dry, and the posted speed limit was 35 mph. One of the drivers was a 43-year-old woman. Would this situation lead to a serious accident?

Being able to answer a question such as this one was the goal of a project I completed for Flatiron School. I’d learned about a dozen different machine learning models, and essentially, the project was a way for me to practice working with them. (GitHub link here.)

I used accident data maintained by the City of Chicago and merged this set with another Chicago dataset of people involved in these accidents, drivers and passengers. After exploring all this data and eliminating outliers, columns with too many empty values and columns I deemed not relevant for my purposes, I ultimately ended up with a number of predictors…from the posted speed limit to the weather, from the number of vehicles involved in the incident to the accident time (hour, day of week, and month). I also had the ages and sex of the drivers.

Every accident was classified by one of six degrees of severity, and I made the target of my modeling “serious accidents,” which was a composite of those accidents that had resulted in a fatality or an incapacitating injury.

The data showed only those accidents with two vehicles had a lower percentage of severity than those with one vehicle or three or more.

It also showed that serious accidents had a higher percentage of male drivers and a lower percentage of female drivers than not-serious accidents did:

Starting the Modeling Process

After I had the data I wished to work with, I transformed columns with categorical data like weather conditions and the month of the crashes into their own columns with pandas’ get_dummies function for better quantification and comparison. In addition, the dataset was imbalanced in that only 1.85% of accidents were classified as “serious accidents.”

Modeling can be difficult with such imbalanced data, because I’d be training my models on data that would classify the majority accidents as not serious. When I would later introduce new data into the model to make predictions, it would not know as well how to predict a serious accident.

So after splitting the data into training and testing sets, I used SMOTE (Synthetic Minority Over-sampling Technique) to re-balance the training set, which I then used for modeling

Initial Model Assessment

When assessing the initial models, I chose to prioritize the metric of recall instead of accuracy in order to prioritize serious accidents, even if some non-serious accidents ended being misclassified as serious, the idea being to maximize public safety. (Recall is calculated by dividing True Positives by the sum of True Positives and False Negatives.)

Also when comparing training and testing metrics, I looked to avoid models that overfit the training data. Overfitting is when a model is trained very well on training data, or really too well in that it has difficulty classifying new data. An indicator of overfitting is when test scores are very different than training scores.

Here is chart of my initial models.

Model Refinement

After the initial modeling process, I chose my top three models to refine, but the Logistic Regression model I couldn’t get much movement at all in the metrics no matter the new parameters I chose.

With Naïve Bayes, when you talk about “naive” classification, you’re assuming that each element is independent of one another. Then, you can estimate an overall probability of the ending up at the target (or not) by multiplying the conditional probabilities for each of the independent features. I ran several models and used scikit-learn’s GridSearch module to facilitate experimentation with various parameters. As already mentioned, I decided to de-emphasize false positives, but still my best model still had too many accidents predicted as serious that were not serious to be useful. In other words, it had too many false positives. Here’s its confusion matrix:

For my Random Forest model, I again ran around eight or nine variations, often using GridSearch. I determined one model had the best recall with the least overfitting.(Its parameters had a maximum depth of two, ten maximum features, and five trees in each forest, a.k.a. n_estimators). Despite this model being my “best” one, it had a low recall score and an overfitting problem.

Still, I thought it was interesting to see the most important features in this model:

If the Project Continued

Although the project and my model exploration eventually had to come to an end, if I had more time to spend, I would run all the models again with my dataset and again choose a handful of them to refine. This time, I would emphasize a different metric, most likely F1 which averages out recall with precision. I would also try parameters to minimize overfitting, which just about all my models had an issue with.

Even though I did not get the results I wanted from this project, I was still a worthwhile introduction to machine learning models.

--

--