6  Data Analysis I

Part 3: Data Analysis

With your data now cleaned and ready, it’s time to ask the questions that matter for journalism. Data analysis isn’t about running tests—it’s about discovering why something matters for your audience.

What you’ll learn: - Transform raw statistics into story angles - Recognize patterns that signal newsworthy trends - Compare groups to find inequality or change - Use data exploration to generate questions instead of just reporting answers

In this part, you’ll work with real Hong Kong employment data and global health statistics to see how analytical techniques reveal different story angles from the same dataset.

In this chapter, we will begin analyzing data to answer questions from a journalistic perspective. I have categorized the questions into three types, each with a corresponding analytical approach:

Question Type Story Angle Example Functions
Single Variable Analysis Understanding distributions and key statistics summary(), table(), unique()
Time-Based Analysis Identifying trends and patterns over time group_by(), geom_line()
Group Comparisons Comparing differences across categories group_by(), summarise()

We will use the hksalary_cleaned.RData to demonstrate these analytical approaches.

6.1 Single Variable Analysis

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
load("out/hksalary_cleaned.RData")
glimpse(df_clean)
Rows: 368
Columns: 4
$ year     <int> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2…
$ level    <chr> "Sub-degree", "Sub-degree", "Sub-degree", "Sub-degree", "Sub-…
$ category <chr> "Medicine, Dentistry and Health", "Sciences", "Engineering an…
$ salary   <dbl> 292, 125, 125, 139, 163, 122, 155, 346, 148, 154, 157, 155, 1…

Unique Values

# Unique degree levels
df_clean |> 
  distinct(level)
# Unique categories
unique(df_clean$category)
[1] "Medicine, Dentistry and Health" "Sciences"                      
[3] "Engineering and Technology"     "Business and Management"       
[5] "Social Sciences"                "Arts and Humanities"           
[7] "Education"                     
unique() vs distinct()

The unique() function returns a vector of unique values, while distinct() returns a data frame with unique rows.

# Category distribution
df_clean |> 
  count(category) |> 
  arrange(desc(n)) 
count()

Using count() on a single variable in this dataset may not provide meaningful insights, as it simply counts the frequency of each category. However, in other contexts—such as the “Billboard Hot100” dataset—count() is useful for counting the number of hot100 songs by each artist, or the number of weeks each song was on the chart.

Salary Baseline Analysis

# Overall salary distribution
df_clean |> 
  summarise(
    avg = mean(salary),
    median = median(salary),
    top_10 = quantile(salary, 0.9)
  )

6.2 Tracking Changes: Time-Based Analysis

Salary Evolution 2014-2023

df_clean |> 
  group_by(year) |> 
  summarise(avg_salary = mean(salary)) 

Let’s check these animations to understand how group_by() and summarise() work.

# group_by cat1
knitr::include_graphics("images/grp-summarize-01.mp4")
# group_by cat2
knitr::include_graphics("images/grp-summarize-02.mp4")
# group_by cat1 and cat2
knitr::include_graphics("images/grp-summarize-03.mp4")
df_clean |> 
  group_by(year) |> 
  summarise(avg_salary = mean(salary))|> 
  mutate(
    change = avg_salary - lag(avg_salary),
    pct_change = change/lag(avg_salary)
  )
lag() Function

The lag() function is used to compute the difference between the current and previous year’s average salary. This helps identify trends and changes over time.

6.3 Revealing Disparities: Group Comparisons

Degree Level Comparison

df_clean |> 
  group_by(level) |> 
  summarise(avg_salary = mean(salary))

Top Earning Fields

df_clean |> 
  group_by(category) |> 
  summarise(avg_salary = mean(salary))

6.4 Key Functions Recap

Function Package Purpose Example Use
distinct() dplyr Returns unique rows based on specified columns df_clean |> distinct(level)
unique() Base R Returns unique values in a vector unique(df_clean$category)
count() dplyr Counts frequency of unique values in a column df_clean |> count(category) |> arrange(desc(n))
summarise() dplyr Computes summary statistics for variables df_clean |> summarise(avg = mean(salary), median = median(salary), top_10 = quantile(salary, 0.9))
group_by() dplyr Groups data by a variable for summary operations df_clean |> group_by(year) |> summarise(avg_salary = mean(salary))
mutate() dplyr Creates or modifies columns in a data frame df_clean |> mutate(change = avg_salary - lag(avg_salary), pct_change = change/lag(avg_salary))
lag() dplyr Computes the lag of a vector (previous values) lag(avg_salary)