Investigating Chicago Traffic Crashes with RSocrata

library(tidyverse)
library(RSocrata)
library(knitr)

The incentive for this article was to try out the RSocrata package since it allows you to access a wealth of open data resources from governments, non-profits, and NGOs around the world. On top of that, it never hurts to practice using the tidyverse functions.

The dataset I will use shows information about each traffic crash on city streets within the City of Chicago limits and under the jurisdiction of Chicago Police Department.

We will download the last two years of data to explore. (Dont be alarmed if this next chunk of code takes a little longer to run. It is because we are scraping over 500,000 data points)

years_ago <- lubridate::today() - lubridate::years(2)
crash_url <- glue::glue("https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if?$where=CRASH_DATE > '{years_ago}'")
crash_raw <- as_tibble(read.socrata(crash_url))

Here is a look at the dataset:

crash <- crash_raw %>%
  arrange(desc(crash_date)) %>%
  transmute(
    injuries = if_else(injuries_total > 0, "injuries", "none"),
    crash_date,
    crash_hour,
    report_type = if_else(report_type == "", "UNKNOWN", report_type),
    num_units,
    posted_speed_limit,
    weather_condition,
    lighting_condition,
    roadway_surface_cond,
    first_crash_type,
    trafficway_type,
    prim_contributory_cause,
    latitude, longitude
  ) %>%
  na.omit()

head(crash) %>% kable()
injuries crash_date crash_hour report_type num_units posted_speed_limit weather_condition lighting_condition roadway_surface_cond first_crash_type trafficway_type prim_contributory_cause latitude longitude
none 2021-06-03 01:38:00 1 ON SCENE 3 30 CLEAR DARKNESS, LIGHTED ROAD UNKNOWN PARKED MOTOR VEHICLE ONE-WAY UNABLE TO DETERMINE 41.90705 -87.75947
none 2021-06-03 01:20:00 1 ON SCENE 3 30 CLEAR DARKNESS, LIGHTED ROAD DRY PARKED MOTOR VEHICLE NOT DIVIDED UNABLE TO DETERMINE 41.90159 -87.76578
none 2021-06-03 00:50:00 0 ON SCENE 2 25 CLEAR DARKNESS, LIGHTED ROAD DRY ANGLE T-INTERSECTION DISREGARDING STOP SIGN 41.88681 -87.67694
injuries 2021-06-03 00:25:00 0 ON SCENE 1 30 CLEAR DARKNESS, LIGHTED ROAD DRY FIXED OBJECT DIVIDED - W/MEDIAN (NOT RAISED) PHYSICAL CONDITION OF DRIVER 41.88621 -87.74077
none 2021-06-03 00:10:00 0 ON SCENE 2 30 CLEAR DARKNESS DRY TURNING NOT DIVIDED UNABLE TO DETERMINE 41.75073 -87.70161
injuries 2021-06-03 00:01:00 0 ON SCENE 3 30 CLEAR DARKNESS, LIGHTED ROAD DRY ANGLE FOUR WAY DISREGARDING TRAFFIC SIGNALS 41.87491 -87.68394

The dataset has the date and time of crash, speed limit, weather condition, lighting condition, the condition of the road, cause, location, and whether there were injuries or not. Let’s see how the number of crashes has changed over time.

crash %>%
  mutate(crash_date = lubridate::floor_date(crash_date, unit = "week")) %>%
  count(crash_date, injuries) %>%
  filter(
    crash_date != last(crash_date),
    crash_date != first(crash_date)) %>%
  ggplot(aes(crash_date, n, color = injuries)) +
  geom_line(size = 1.5, alpha = 0.7) +
  scale_y_continuous(limits = (c(0, NA))) +
  labs(
    x = NULL, y = "Number of traffic crashes per week",
    color = "Injuries?"
  )

The first thing that stands out is the impact of the global pandemic. There was a significant drop in the number of crashes per week at the beginning of 2020. After the fall, the number of crashes with injuries rebounded, but still remains below its pre-pandemic levels.

Since we see such a change in crashes with injuries and a smaller impact on non-injury crashes, it might be interesting to see how the injury rate has changed over time.

crash %>%
  mutate(crash_date = lubridate:: floor_date(crash_date, unit = 'week')) %>%
  count(crash_date, injuries) %>%
  filter(
    crash_date != last(crash_date),
    crash_date != first(crash_date)) %>%
  group_by(crash_date) %>%
  mutate(percent_injury = n/sum(n)) %>%
  ungroup() %>%
  filter(injuries == 'injuries') %>%
  ggplot(aes(crash_date, percent_injury)) +
  geom_line(size=1.5, alpha=0.7, color = 'midnightblue') +
  scale_y_continuous(limits = c(0,NA), labels = scales::percent_format()) +
  labs(x=NULL, y='Percent of crashes that involve injuries')

Suprisingly, the injury rate seemed to have increased during the pandemic. I was not expecting that. I wonder what caused that.

Have your parents ever told you to be careful when driving on the weekend because drivers are more reckless? Let’s see how true that is in this dataset. We will check how the injury rate changes over the course of the week.

crash %>%
  mutate(crash_date = lubridate::wday(crash_date, label = TRUE)) %>%
  count(crash_date, injuries) %>%
  group_by(injuries) %>%
  mutate(percent = n/sum(n)) %>%
  ungroup() %>%
  ggplot(aes(percent, crash_date, fill = injuries)) +
  geom_col(position = 'dodge', alpha=0.8) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(x='Percent of Crashes', y= NULL, fill = 'Injuries?') +
  coord_flip()

Friday and Saturday seem to have the highest injury rate, but the difference with the rest of the weekdays is not very large. Even though there might be some validity to my parents’ warning, the data does not reveal a significant relationship.

By the same token, my parents say that most crashes occur at night. Let’s check if this is reflected here.

crash %>%
  count(lighting_condition, injuries) %>%
  group_by(injuries) %>%
  mutate(percent = n/sum(n)) %>%
  ungroup() %>%
  ggplot(aes(percent, lighting_condition, fill = injuries)) +
  geom_col(position = 'dodge', alpha = 0.8) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(x='Percent of Crashes', y= NULL, fill = 'Injuries?')

Predictably, the largest portion of crashes, both with and without injuries, take place during daylight. This makes sense since most car usage is during daylight.

Another thing we can look at is how injuries vary with the first crash type. Do certain type of crashes lead to more injuries?

crash %>%
  count(first_crash_type, injuries) %>%
  mutate(first_crash_type = fct_reorder(first_crash_type, n)) %>%
  group_by(injuries) %>%
  mutate(percent = n/sum(n)) %>%
  ungroup() %>%
  group_by(first_crash_type) %>%
  filter(sum(n)>1e4) %>%
  ungroup() %>%
  ggplot(aes(percent, first_crash_type, fill = injuries)) +
  geom_col(position = 'dodge', alpha = 0.8) +
  scale_x_continuous(labels = scales::percent_format()) +
  labs(x='Percent of Crashes', y=NULL, fill = 'Injuries?')

We can see that Rear End crashes are the most hurtful type, closely followed by turning and angle style crashes.

Now let’s look at where injuries are more common. Since we have both the longitude and latitude of each crash, we can create a map of the city and plot where injuries occur most.

crash %>%
  filter(latitude>0) %>%
  ggplot(aes(longitude, latitude, color = injuries)) +
  geom_point(size = 0.6, alpha = 0.5) +
  labs(color = NULL) +
  scale_color_manual(values = c("deeppink4", "gray80")) +
  coord_fixed() +
  theme(axis.title = element_blank())

The crashes with injuries seem to be dispered all throughout the city of Chicago.

As always, if you have a question or a suggestion related to the topic covered in this article, please feel free to contact me!

Ian Krupkin
Ian Krupkin
Statistics Major