library(tidyverse)
library(rnaturalearth) # install if not available
5 Pivoting and Joining
5.1 Overview
In this chapter, we will explore the life expectancy data across countries. Life expectancy is a key indicator of a country’s overall health and well-being. By analyzing this data, we can identify trends and disparities in life expectancy across different regions and countries.
About the data
This dataset was downloaded from the World Bank website and saved as a CSV file named life_expectancy.csv
, stored in the data
folder. The dataset was last updated on March 24, 2025, contains the life expectancy at birth for various countries over the years.
Learning Objectives
This chapter will demonstrate two data wrangling techniques:
- Transforming wide data to long data format using
pivot_longer()
. - Merging datasets using
left_join()
for map visualization.
5.2 Load data and packages
To merge the life expectancy data with the world map data, we will use the rnaturalearth
package for world map data.
# Load the life expectancy data
<- read_csv("data/life_expectancy.csv") lifex
New names:
Rows: 266 Columns: 69
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(4): Country Name, Country Code, Indicator Name, Indicator Code dbl (63): 1960,
1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, ... lgl (2): 2023,
...69
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...69`
glimpse(lifex)
Rows: 266
Columns: 69
$ `Country Name` <chr> "Aruba", "Africa Eastern and Southern", "Afghanistan"…
$ `Country Code` <chr> "ABW", "AFE", "AFG", "AFW", "AGO", "ALB", "AND", "ARB…
$ `Indicator Name` <chr> "Life expectancy at birth, total (years)", "Life expe…
$ `Indicator Code` <chr> "SP.DYN.LE00.IN", "SP.DYN.LE00.IN", "SP.DYN.LE00.IN",…
$ `1960` <dbl> 64.15200, 44.09883, 32.53500, 37.84614, 38.21100, 54.…
$ `1961` <dbl> 64.53700, 44.40108, 33.06800, 38.16548, 37.26700, 55.…
$ `1962` <dbl> 64.75200, 44.76804, 33.54700, 38.73579, 37.53900, 56.…
$ `1963` <dbl> 65.13200, 44.92569, 34.01600, 39.06327, 37.82400, 57.…
$ `1964` <dbl> 65.29400, 45.49893, 34.49400, 39.33362, 38.13100, 58.…
$ `1965` <dbl> 65.50200, 45.51282, 34.95300, 39.61488, 38.49500, 60.…
$ `1966` <dbl> 66.06300, 45.26340, 35.45300, 39.83348, 38.75700, 60.…
$ `1967` <dbl> 66.43900, 45.93304, 35.92400, 39.46643, 39.09200, 61.…
$ `1968` <dbl> 66.75700, 46.22975, 36.41800, 40.07945, 39.48400, 62.…
$ `1969` <dbl> 67.16800, 46.43694, 36.91000, 40.34288, 39.82900, 63.…
$ `1970` <dbl> 67.58300, 46.72371, 37.41800, 41.02645, 40.19000, 64.…
$ `1971` <dbl> 67.97500, 47.20967, 37.92300, 41.54788, 40.55400, 65.…
$ `1972` <dbl> 68.57700, 46.91316, 38.44400, 42.24128, 40.90500, 66.…
$ `1973` <dbl> 69.09200, 47.72271, 39.00300, 42.84673, 41.27000, 67.…
$ `1974` <dbl> 69.50300, 47.63436, 39.55000, 43.48896, 41.65200, 67.…
$ `1975` <dbl> 69.76200, 47.79288, 40.10000, 44.19265, 41.19100, 68.…
$ `1976` <dbl> 70.03500, 48.37347, 40.64500, 44.99494, 41.16300, 68.…
$ `1977` <dbl> 70.26400, 48.65676, 41.22800, 45.71210, 41.43700, 69.…
$ `1978` <dbl> 70.49400, 48.78190, 40.27100, 46.26202, 41.83000, 69.…
$ `1979` <dbl> 70.77800, 49.27690, 39.08600, 46.66625, 42.17500, 69.…
$ `1980` <dbl> 71.06600, 49.65491, 39.61800, 47.00825, 42.44900, 70.…
$ `1981` <dbl> 71.72200, 50.08262, 40.16400, 47.29049, 42.77200, 70.…
$ `1982` <dbl> 71.95900, 50.33266, 37.76600, 47.52312, 43.05100, 71.…
$ `1983` <dbl> 72.10500, 48.75816, 38.18700, 47.77922, 42.09200, 71.…
$ `1984` <dbl> 72.25100, 48.71600, 33.32900, 47.92625, 42.35300, 71.…
$ `1985` <dbl> 72.38800, 49.08027, 33.55000, 48.01570, 42.64800, 71.…
$ `1986` <dbl> 72.46200, 49.71145, 39.39600, 48.06035, 42.84300, 71.…
$ `1987` <dbl> 72.78900, 50.11539, 39.84400, 48.22949, 40.91700, 72.…
$ `1988` <dbl> 73.04700, 49.37453, 43.95800, 48.50468, 41.54500, 72.…
$ `1989` <dbl> 73.02300, 50.72100, 45.15800, 48.68021, 41.76500, 72.…
$ `1990` <dbl> 73.07600, 50.64266, 45.96700, 48.63821, 41.89300, 73.…
$ `1991` <dbl> 73.10000, 50.41606, 46.66300, 48.64873, 43.81300, 73.…
$ `1992` <dbl> 73.17900, 49.97447, 47.59600, 48.72149, 42.20900, 73.…
$ `1993` <dbl> 73.22500, 50.28794, 51.46600, 48.81533, 42.10100, 73.…
$ `1994` <dbl> 73.27200, 50.91639, 51.49500, 48.66470, 43.42200, 74.…
$ `1995` <dbl> 73.34900, 51.01604, 52.54400, 48.76475, 45.84900, 74.…
$ `1996` <dbl> 73.44800, 50.81934, 53.24300, 48.88525, 46.03300, 74.…
$ `1997` <dbl> 73.45200, 50.98197, 53.63400, 49.05807, 46.30600, 73.…
$ `1998` <dbl> 73.49100, 50.32711, 52.94300, 49.30990, 45.05700, 74.…
$ `1999` <dbl> 73.56100, 51.25887, 54.84600, 49.72643, 45.38600, 75.…
$ `2000` <dbl> 73.56900, 51.98699, 55.29800, 50.19459, 46.02400, 75.…
$ `2001` <dbl> 73.64700, 52.21224, 55.79800, 50.53735, 46.59000, 75.…
$ `2002` <dbl> 73.72600, 52.56278, 56.45400, 50.89809, 47.38600, 75.…
$ `2003` <dbl> 73.75200, 53.03630, 57.34400, 51.37161, 49.61700, 76.…
$ `2004` <dbl> 73.57600, 53.55188, 57.94400, 51.78477, 50.59200, 76.…
$ `2005` <dbl> 73.81100, 54.22295, 58.36100, 52.30806, 51.57000, 76.…
$ `2006` <dbl> 74.02600, 55.15772, 58.68400, 52.79254, 52.36900, 76.…
$ `2007` <dbl> 74.21000, 55.93784, 59.11100, 53.20900, 53.64200, 77.…
$ `2008` <dbl> 74.14700, 56.68074, 59.85200, 53.59415, 54.63300, 77.…
$ `2009` <dbl> 74.56000, 57.62151, 60.36400, 54.11171, 55.75200, 77.…
$ `2010` <dbl> 75.40400, 58.41032, 60.85100, 54.50047, 56.72600, 77.…
$ `2011` <dbl> 75.46500, 59.29332, 61.41900, 54.96017, 57.59600, 78.…
$ `2012` <dbl> 75.53100, 60.05153, 61.92300, 55.28394, 58.62300, 78.…
$ `2013` <dbl> 75.63600, 60.71003, 62.41700, 55.61514, 59.30700, 78.…
$ `2014` <dbl> 75.60100, 61.33881, 62.54500, 55.86366, 60.04000, 78.…
$ `2015` <dbl> 75.68300, 61.85686, 62.65900, 56.13527, 60.65500, 78.…
$ `2016` <dbl> 75.61700, 62.44464, 63.13600, 56.51809, 61.09200, 78.…
$ `2017` <dbl> 75.90300, 62.92481, 63.01600, 56.82663, 61.68000, 79.…
$ `2018` <dbl> 76.07200, 63.36704, 63.08100, 57.12971, 62.14400, 79.…
$ `2019` <dbl> 76.24800, 63.75475, 63.56500, 57.50029, 62.44800, 79.…
$ `2020` <dbl> 75.72300, 63.30979, 62.57500, 57.18067, 62.26100, 76.…
$ `2021` <dbl> 74.62600, 62.44909, 61.98200, 56.94647, 61.64300, 76.…
$ `2022` <dbl> 74.99200, 62.88846, 62.87900, 57.58911, 61.92900, 76.…
$ `2023` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ...69 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Based on the output of glimpse()
, we can see that the dataset contains 217 rows and 69 columns. However, the last 2 columns contain all NA values, so we may want to remove them from analysis.
You’ll notice that this format isn’t ideal for data visualization. Columns 5 to 67 represent years, with their values corresponding to life expectancy for each year. However, a cleaner format would be a long data structure, where each row represents a country, a year, and the corresponding life expectancy.
To achieve this, we need to convert the wide format into a long format using the pivot_longer()
function.
5.3 Long vs Wide Data
In R, we generally prefer working with the long data (tidy) format for data analysis, as it aligns with the principles of tidy data. A typical tidy dataset follows these rules:
- Each variable has its own column.
- Each observation has its own row.
- Each value has its own cell.
In contrast, wide data format has multiple columns for the same variable, which can make it harder to work with for data analysis and visualization. For example, in this life expectancy dataset, each year is represented as a separate column, however, ideally we want all years to be in a single column named “year”, and all values to be in a single column named “life_expectancy”.
5.4 Transform Wide to Long Data
Let’s check this animation made by Garrick Aden-Buie to understand the transformation from wide to long data format.
# include a gif
::include_graphics("images/tidyr-pivoting.gif") knitr
# Transform wide data to long data format
<- lifex |>
lifex_long rename(country = `Country Name`,
country_code = `Country Code`) |>
pivot_longer(cols = 5:67,
names_to = "year",
values_to = "life_expectancy") |>
mutate(year = as.numeric(year)) |>
select(country, country_code, year, life_expectancy)
glimpse(lifex_long)
Rows: 16,758
Columns: 4
$ country <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", …
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW"…
$ year <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, …
$ life_expectancy <dbl> 64.152, 64.537, 64.752, 65.132, 65.294, 65.502, 66.063…
summary(lifex_long)
country country_code year life_expectancy
Length:16758 Length:16758 Min. :1960 Min. :11.99
Class :character Class :character 1st Qu.:1975 1st Qu.:57.03
Mode :character Mode :character Median :1991 Median :66.96
Mean :1991 Mean :64.39
3rd Qu.:2007 3rd Qu.:72.65
Max. :2022 Max. :85.53
NA's :634
5.5 Merging with World Map Data: mutating join
OK, now we have cleaned up the life expectancy data and transformed it into a long format. Next, we will merge this data with the world map data to prepare for map visualization.
We will load the world map data from the package rnaturalearth
, which provides spatial data for countries.
Load World Map Data
<- ne_countries(scale = "medium", returnclass = "sf") |>
world select(iso_a3, geometry)
ne_countries()
function from the rnaturalearth
package returns a spatial feature object with country geometries. ne
stands for Natural Earth, which is a public domain map dataset. We select the iso_a3
column as the country code (because it’s in the same format as the country
column in the world bank data) and the geometry
column for plotting.
Merge Data by left_join()
Then, we will use the left_join()
function to merge the life expectancy data with the world map data based on the country_code
and iso_a3
columns.
Here is how the left_join()
function works (by Garrick Aden-Buie):
::include_graphics("images/left-join.gif") knitr
Here we will use left_join
to merge the world map data with the life expectancy data based on the country codes. The reason we let world
be the left table is that we want to keep all countries in the world map data, even if they don’t have life expectancy data. Otherwise, we would lose some countries in the map.
# Merge life expectancy data with world map data
<- world |>
df_merge left_join(lifex_long, by = c("iso_a3" = "country_code"))
glimpse(df_merge)
Rows: 13,386
Columns: 5
$ iso_a3 <chr> "ZWE", "ZWE", "ZWE", "ZWE", "ZWE", "ZWE", "ZWE", "ZWE"…
$ country <chr> "Zimbabwe", "Zimbabwe", "Zimbabwe", "Zimbabwe", "Zimba…
$ year <dbl> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, …
$ life_expectancy <dbl> 53.235, 53.619, 54.071, 54.549, 54.994, 55.459, 55.889…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((31.28789 -2..., MULTIPOLY…
summary(df_merge)
iso_a3 country year life_expectancy
Length:13386 Length:13386 Min. :1960 Min. :11.99
Class :character Class :character 1st Qu.:1975 1st Qu.:57.53
Mode :character Mode :character Median :1991 Median :67.31
Mean :1991 Mean :64.57
3rd Qu.:2007 3rd Qu.:72.83
Max. :2022 Max. :85.53
NA's :30 NA's :601
geometry
MULTIPOLYGON :13386
epsg:4326 : 0
+proj=long...: 0
Here we found there are some NAs for year and life_expectancy, which means some countries don’t have life expectancy data in some years.
Clean up NAs and spatial features
|>
df_merge filter(is.na(year) | is.na(life_expectancy))
library(sf)
Warning: package 'sf' was built under R version 4.3.3
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
<- df_merge |>
df_merge filter(!is.na(year) & !is.na(life_expectancy))
5.6 Save the Merged Data
# Save the merged data to .RData
save(df_merge, file = "out/expectancy_merged.RData")
See Chapter 10 for following data analysis and visualization.