In this chapter, we will begin analyzing data to answer questions from a journalistic perspective. I have categorized the questions into three types, each with a corresponding analytical approach:
Question Type
Story Angle
Example Functions
Single Variable Analysis
Understanding distributions and key statistics
summary(), table(), unique()
Time-Based Analysis
Identifying trends and patterns over time
group_by(), geom_line()
Group Comparisons
Comparing differences across categories
group_by(), summarise()
We will use the hksalary_cleaned.RData to demonstrate these analytical approaches.
6.1 Single Variable Analysis
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Using count() on a single variable in this dataset may not provide meaningful insights, as it simply counts the frequency of each category. However, in other contexts—such as the “Billboard Hot100” dataset—count() is useful for counting the number of hot100 songs by each artist, or the number of weeks each song was on the chart.
The lag() function is used to compute the difference between the current and previous year’s average salary. This helps identify trends and changes over time.