6 Data Analysis I

In this chapter, we will begin analyzing data to answer questions from a journalistic perspective. I have categorized the questions into three types, each with a corresponding analytical approach:

Question Type	Story Angle	Example Functions
Single Variable Analysis	Understanding distributions and key statistics	`summary()`, `table()`, `unique()`
Time-Based Analysis	Identifying trends and patterns over time	`group_by()`, `geom_line()`
Group Comparisons	Comparing differences across categories	`group_by()`, `summarise()`

We will use the hksalary_cleaned.RData to demonstrate these analytical approaches.

6.1 Single Variable Analysis

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

load("out/hksalary_cleaned.RData")
glimpse(df_clean)

Rows: 368
Columns: 4
$ year     <int> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2…
$ level    <chr> "Sub-degree", "Sub-degree", "Sub-degree", "Sub-degree", "Sub-…
$ category <chr> "Medicine, Dentistry and Health", "Sciences", "Engineering an…
$ salary   <dbl> 292, 125, 125, 139, 163, 122, 155, 346, 148, 154, 157, 155, 1…

Unique Values

# Unique degree levels
df_clean |> 
  distinct(level)

# Unique categories
unique(df_clean$category)

[1] "Medicine, Dentistry and Health" "Sciences"                      
[3] "Engineering and Technology"     "Business and Management"       
[5] "Social Sciences"                "Arts and Humanities"           
[7] "Education"

unique() vs distinct()

The unique() function returns a vector of unique values, while distinct() returns a data frame with unique rows.

# Category distribution
df_clean |> 
  count(category) |> 
  arrange(desc(n))

count()

Using count() on a single variable in this dataset may not provide meaningful insights, as it simply counts the frequency of each category. However, in other contexts—such as the “Billboard Hot100” dataset—count() is useful for counting the number of hot100 songs by each artist, or the number of weeks each song was on the chart.

Salary Baseline Analysis

# Overall salary distribution
df_clean |> 
  summarise(
    avg = mean(salary),
    median = median(salary),
    top_10 = quantile(salary, 0.9)
  )

6.2 Tracking Changes: Time-Based Analysis

Salary Evolution 2014-2023

df_clean |> 
  group_by(year) |> 
  summarise(avg_salary = mean(salary))

Let’s check these animations to understand how group_by() and summarise() work.

# group_by cat1
knitr::include_graphics("images/grp-summarize-01.mp4")

# group_by cat2
knitr::include_graphics("images/grp-summarize-02.mp4")

# group_by cat1 and cat2
knitr::include_graphics("images/grp-summarize-03.mp4")

df_clean |> 
  group_by(year) |> 
  summarise(avg_salary = mean(salary))|> 
  mutate(
    change = avg_salary - lag(avg_salary),
    pct_change = change/lag(avg_salary)
  )

lag() Function

The lag() function is used to compute the difference between the current and previous year’s average salary. This helps identify trends and changes over time.

6.3 Revealing Disparities: Group Comparisons

Degree Level Comparison

df_clean |> 
  group_by(level) |> 
  summarise(avg_salary = mean(salary))

Top Earning Fields

df_clean |> 
  group_by(category) |> 
  summarise(avg_salary = mean(salary))

6.4 Key Functions Recap

Function	Package	Purpose	Example Use
`distinct()`	`dplyr`	Returns unique rows based on specified columns	`df_clean \|> distinct(level)`
`unique()`	Base R	Returns unique values in a vector	`unique(df_clean$category)`
`count()`	`dplyr`	Counts frequency of unique values in a column	`df_clean \|> count(category) \|> arrange(desc(n))`
`summarise()`	`dplyr`	Computes summary statistics for variables	`df_clean \|> summarise(avg = mean(salary), median = median(salary), top_10 = quantile(salary, 0.9))`
`group_by()`	`dplyr`	Groups data by a variable for summary operations	`df_clean \|> group_by(year) \|> summarise(avg_salary = mean(salary))`
`mutate()`	`dplyr`	Creates or modifies columns in a data frame	`df_clean \|> mutate(change = avg_salary - lag(avg_salary), pct_change = change/lag(avg_salary))`
`lag()`	`dplyr`	Computes the lag of a vector (previous values)	`lag(avg_salary)`