6  Data Analysis I

In this chapter, we will begin analyzing data to answer questions from a journalistic perspective. I have categorized the questions into three types, each with a corresponding analytical approach:

Question Type Story Angle Example Functions
Single Variable Analysis Understanding distributions and key statistics summary(), table(), unique()
Time-Based Analysis Identifying trends and patterns over time group_by(), geom_line()
Group Comparisons Comparing differences across categories group_by(), summarise()

We will use the hksalary_cleaned.RData to demonstrate these analytical approaches.

6.1 Single Variable Analysis

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
load("out/hksalary_cleaned.RData")
glimpse(df_clean)
Rows: 368
Columns: 4
$ year     <int> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2…
$ level    <chr> "Sub-degree", "Sub-degree", "Sub-degree", "Sub-degree", "Sub-…
$ category <chr> "Medicine, Dentistry and Health", "Sciences", "Engineering an…
$ salary   <dbl> 292, 125, 125, 139, 163, 122, 155, 346, 148, 154, 157, 155, 1…

Unique Values

# Unique degree levels
df_clean |> 
  distinct(level)
# Unique categories
unique(df_clean$category)
[1] "Medicine, Dentistry and Health" "Sciences"                      
[3] "Engineering and Technology"     "Business and Management"       
[5] "Social Sciences"                "Arts and Humanities"           
[7] "Education"                     
unique() vs distinct()

The unique() function returns a vector of unique values, while distinct() returns a data frame with unique rows.

# Category distribution
df_clean |> 
  count(category) |> 
  arrange(desc(n)) 
count()

Using count() on a single variable in this dataset may not provide meaningful insights, as it simply counts the frequency of each category. However, in other contexts—such as the “Billboard Hot100” dataset—count() is useful for counting the number of hot100 songs by each artist, or the number of weeks each song was on the chart.

Salary Baseline Analysis

# Overall salary distribution
df_clean |> 
  summarise(
    avg = mean(salary),
    median = median(salary),
    top_10 = quantile(salary, 0.9)
  )

6.2 Tracking Changes: Time-Based Analysis

Salary Evolution 2014-2023

df_clean |> 
  group_by(year) |> 
  summarise(avg_salary = mean(salary)) 
df_clean |> 
  group_by(year) |> 
  summarise(avg_salary = mean(salary))|> 
  mutate(
    change = avg_salary - lag(avg_salary),
    pct_change = change/lag(avg_salary)
  )
lag() Function

The lag() function is used to compute the difference between the current and previous year’s average salary. This helps identify trends and changes over time.

6.3 Revealing Disparities: Group Comparisons

Degree Level Comparison

df_clean |> 
  group_by(level) |> 
  summarise(avg_salary = mean(salary))

Top Earning Fields

df_clean |> 
  group_by(category) |> 
  summarise(avg_salary = mean(salary))

6.4 Key Functions Recap

Function Package Purpose Example Use
distinct() dplyr Returns unique rows based on specified columns df_clean |> distinct(level)
unique() Base R Returns unique values in a vector unique(df_clean$category)
count() dplyr Counts frequency of unique values in a column df_clean |> count(category) |> arrange(desc(n))
summarise() dplyr Computes summary statistics for variables df_clean |> summarise(avg = mean(salary), median = median(salary), top_10 = quantile(salary, 0.9))
group_by() dplyr Groups data by a variable for summary operations df_clean |> group_by(year) |> summarise(avg_salary = mean(salary))
mutate() dplyr Creates or modifies columns in a data frame df_clean |> mutate(change = avg_salary - lag(avg_salary), pct_change = change/lag(avg_salary))
lag() dplyr Computes the lag of a vector (previous values) lag(avg_salary)