14 Paris 2024 Olympics

14.1 Data Import

library(tidyverse)
df <- read_csv("data/medallists.csv")
colnames(df)

 [1] "medal_date"   "medal_type"   "medal_code"   "name"         "gender"      
 [6] "country_code" "country"      "country_long" "nationality"  "team"        
[11] "team_gender"  "discipline"   "event"        "event_type"   "url_event"   
[16] "birth_date"   "code_athlete" "code_team"

14.2 Data Clean

df_clean <- df |>
  select(name, gender, country, medal_type, birth_date, discipline, event, code_athlete, country_code)

glimpse(df_clean)

Rows: 2,315
Columns: 9
$ name         <chr> "EVENEPOEL Remco", "GANNA Filippo", "van AERT Wout", "BRO…
$ gender       <chr> "Male", "Male", "Male", "Female", "Female", "Female", "Ma…
$ country      <chr> "Belgium", "Italy", "Belgium", "Australia", "Great Britai…
$ medal_type   <chr> "Gold Medal", "Silver Medal", "Bronze Medal", "Gold Medal…
$ birth_date   <date> 2000-01-25, 1996-07-25, 1994-09-15, 1992-07-07, 1998-11-…
$ discipline   <chr> "Cycling Road", "Cycling Road", "Cycling Road", "Cycling …
$ event        <chr> "Men's Individual Time Trial", "Men's Individual Time Tri…
$ code_athlete <dbl> 1903136, 1923520, 1903147, 1940173, 1912525, 1955079, 192…
$ country_code <chr> "BEL", "ITA", "BEL", "AUS", "GBR", "USA", "KOR", "TUN", "…

14.3 Data Analysis

Q1: Who won the most medals in Paris 2024?
Q2: Who won the most gold medals in Paris 2024?
Q3: Which country has the most medallists in Paris 2024?

Q1: Who won the most medals in Paris 2024?

df_clean |>
  group_by(name) |>
  summarise(medal_count = n()) |>
  arrange(desc(medal_count)) |>
  head(10)

Q2: Who won the most gold medals in Paris 2024?

df_clean |>
  filter(medal_type == "Gold Medal") |>
  group_by(name) |>
  summarise(gold_count = n()) |>
  arrange(desc(gold_count)) |>
  head(10)

Q3: Which country has the most medallists in Paris 2024?

df_clean |>
  group_by(country) |>
  summarise(medallist_count = n()) |>
  arrange(desc(medallist_count)) |>
  head(10)

14.4 Visualization

Bar Chart

We will use a bar chart to show top 20 country with most medallists in Paris 2024.

df_clean |>
  group_by(country) |>
  summarise(medallist_count = n()) |>
  arrange(desc(medallist_count)) |>
  head(20) |>
1  ggplot(aes(x = reorder(country, medallist_count), y = medallist_count)) +
2  geom_col() +
3  coord_flip() +
  labs(title = "Top 20 Countries with Most Medallists in Paris 2024",
       x = "Country",
       y = "Number of Medallists") +
  theme_minimal()

1: Use reorder() to reorder countries based on the number of medallists.
2: Use geom_col() to create a bar chart.
3: Use coord_flip() to flip the x and y axes because we want to show the countries on the y-axis for better readability.

Note

Why do we need to “reorder()” even if we have already used “arrange()”? 1. arrange(): This function reorders the rows of the data frame based on a variable, but it doesn’t change the internal ordering of factor levels in R. Factor levels determine how categorical variables (like country) are displayed on the plot’s axes. 2. reorder(): In the aes() function of ggplot(), reorder(country, medallist_count) specifically reorders the factor levels of country based on the values of medallist_count. This ensures that when plotting, the countries with the highest medallist_count are shown in the order you want on the x-axis (or y-axis).

Scatter Plot

We will use a scatter plot to show the relationships between the number of medals and the age of the medallists.

df_clean |>
  mutate(age = 2024 - year(birth_date)) |>
  filter(is.na(age) == FALSE) |>
  group_by(age) |>
  summarise(medallist_count = n()) |>
1  ggplot(aes(x = age, y = medallist_count)) +
2  geom_point() +
3  geom_smooth(method = "lm") +
  labs(title = "Relationship between Age and Number of Medallists",
       x = "Age",
       y = "Number of Medallists") +
  theme_minimal()

1: Use aes() to specify the x and y variables.
2: Use geom_point() to create a scatter plot.
3: Use geom_smooth(method = "lm") to add a linear regression line to the plot.

`geom_smooth()` using formula = 'y ~ x'

World map

Use a world map to show the distribution of medallists in Paris 2024.

Here we will use two packages: rnaturalearth and rnaturalearthdata to get the world map data. The rnaturalearth package provides access to the Natural Earth dataset, which contains a variety of geospatial data, including country boundaries. The rnaturalearthdata package contains the data files needed to create maps using the rnaturalearth package.

library(rnaturalearth)
library(rnaturalearthdata)

1: Use ne_countries() to get the world map data. The scale parameter specifies the level of detail for the map, and the returnclass parameter specifies the class of the returned object (in this case, a simple features object).


Attaching package: 'rnaturalearthdata'

The following object is masked from 'package:rnaturalearth':

    countries110

1world <- ne_countries(scale = "medium", returnclass = "sf")

df_clean |>
  group_by(country_code) |>
  summarise(medallist_count = n()) |>
1  left_join(world, by = c("country_code" = "iso_a3")) |>
2  ggplot() +
3  geom_sf(aes(fill = medallist_count, geometry = geometry)) +
4  scale_fill_viridis_c(option = "B", direction = -1, begin = 0.3, end = 0.8) +
  labs(title = "Distribution of Medallists in Paris 2024") +
  theme_bw() +
  theme(
    panel.grid.major = element_blank(),  # Remove major grid lines
    panel.grid.minor = element_blank(),  # Remove minor grid lines
    axis.text = element_blank(),         # Remove axis text
    axis.title = element_blank(),        # Remove axis titles
    axis.ticks = element_blank(),        # Remove axis ticks
    panel.border = element_blank(),      # Remove panel border if desired
    plot.background = element_rect(fill = "white", colour = NA),  # Set plot background color
    legend.position = "bottom",          # Adjust legend position
    plot.title = element_text(hjust = 0.5)  # Center the plot title
  )

1: Use left_join() to join the world data with the df_clean data based on the country_code and iso_a3 columns. iso_a3 is the ISO 3166-1 alpha-3 country code, such as “USA” for the United States.
2: Use ggplot() to create a plot.
3: Use geom_sf() to create a map. The fill aesthetic is set to medallist_count to color the map based on the number of medallists in each country. The geometry aesthetic specifies the geometry column in the world data, which means the map will be drawn based on the country boundaries.
4: Use scale_fill_viridis_c() to set the color scale for the map. The option parameter specifies the color palette, direction specifies the direction of the color gradient, and begin and end specify the color range.

Note

geom_sf is a function in the ggplot2 package used to visualize simple features (spatial data) in R. It is specifically designed to handle geospatial objects, such as polygons, points, and lines that represent geographic data. The “sf” stands for simple features, a standard way to encode spatial vector data.