With your data now cleaned and ready, it’s time to ask the questions that matter for journalism. Data analysis isn’t about running tests—it’s about discovering why something matters for your audience.
What you’ll learn: - Transform raw statistics into story angles - Recognize patterns that signal newsworthy trends - Compare groups to find inequality or change - Use data exploration to generate questions instead of just reporting answers
In this part, you’ll work with real Hong Kong employment data and global health statistics to see how analytical techniques reveal different story angles from the same dataset.
In this chapter, we will begin analyzing data to answer questions from a journalistic perspective. I have categorized the questions into three types, each with a corresponding analytical approach:
Question Type
Story Angle
Example Functions
Single Variable Analysis
Understanding distributions and key statistics
summary(), table(), unique()
Time-Based Analysis
Identifying trends and patterns over time
group_by(), geom_line()
Group Comparisons
Comparing differences across categories
group_by(), summarise()
We will use the hksalary_cleaned.RData to demonstrate these analytical approaches.
6.1 Single Variable Analysis
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Using count() on a single variable in this dataset may not provide meaningful insights, as it simply counts the frequency of each category. However, in other contexts—such as the “Billboard Hot100” dataset—count() is useful for counting the number of hot100 songs by each artist, or the number of weeks each song was on the chart.
The lag() function is used to compute the difference between the current and previous year’s average salary. This helps identify trends and changes over time.