3 Data Import

3.1 Importing Data in R

Built-in Datasets from Packages

R comes with a variety of built-in datasets that can be loaded directly from packages, here are some popular ones:

Package	Key Datasets	Load Command
`datasets`	mtcars, iris	Built-in
`ggplot2`	diamonds, mpg	`library(tidyverse)`
`nycflights13`	flights, weather	`install.packages()`
`gapminder`	gapminder	`library(gapminder)`

Usage Example:

mtcars

Downloading External Data

From TidyTuesday

#install.packages("tidytuesdayR")
library(tidytuesdayR)

# Load 2024 Olympics data
tuesdata <- tt_load('2024-08-06')  
olympics <- tuesdata$olympics

Direct from URL

library(tidyverse)

# Hong Kong graduates salary data

data_url = "https://www.ugcs.gov.hk/datagovhk/Average_Annual_Salaries_FT_Employment(Eng).csv"
hksalary_download <- read_csv(data_url)

hksalary_download

Local File Import

In data analysis projects, importing local files is more common than importing data from the web. Here are some common file types and their uses:

CSV Files: Simple, human-readable, and widely supported. Ideal for tabular data.
Excel Files: Used for spreadsheets with multiple sheets or formatting. Imported with readxl or openxlsx.
SPSS, SAS, Stata Files: Common in social science and survey research. Use specialized R packages to import.
RDS Files: Binary format for storing R objects, preserving their structure and class information.
RData Files: Binary format for saving multiple R objects in a single file, often used for workspaces.

Note

For this course, we will focus on CSV files, as they are simple and widely used.

CSV Files with `readr` (`tidyverse`) package

First, we download the CSV file from the web and save it locally as hksalary.csv. Then, we import it using the read_csv() function from the readr package.

read_csv() vs. read.csv()

Note that read_csv() from readr is preferred over read.csv() from base R for its speed and consistency. In this course, we recommend using read_csv() for CSV files.

# Relative path (recommended)
hksalary <- read_csv("data/hksalary.csv")

Rows: 368 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Academic Year, Level of Study, Broad Academic Programme Category
dbl (1): Average Annual Salary (HK$'000)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

hksalary

File Path Management

Path Type	Example	When to Use
Relative	`data/hksalary.csv`	Default in projects
Absolute	`C:/Users/.../hksalary.csv`	Temporary analysis

Relative Path

Use relative paths for portability and to avoid hardcoding directory paths.

3.2 Data Inspection

After importing data, it’s essential to inspect it to understand its structure and contents. Here are some common functions to help you get started:

First Look Tools

head(): Shows the first few rows of the dataset.

# First 6 rows
head(hksalary)

glimpse(): Provides a concise summary of the dataset’s structure.

# # Tidyverse alternative to str()
glimpse(hksalary)

Rows: 368
Columns: 4
$ `Academic Year`                     <chr> "2009/10", "2009/10", "2009/10", "…
$ `Level of Study`                    <chr> "Sub-degree", "Sub-degree", "Sub-d…
$ `Broad Academic Programme Category` <chr> "Medicine, Dentistry and Health", …
$ `Average Annual Salary (HK$'000)`   <dbl> 292, 125, 125, 139, 163, 122, 155,…

summary(): Displays a statistical summary of the dataset.

summary(hksalary)

 Academic Year      Level of Study     Broad Academic Programme Category
 Length:368         Length:368         Length:368                       
 Class :character   Class :character   Class :character                 
 Mode  :character   Mode  :character   Mode  :character                 
                                                                        
                                                                        
                                                                        
 Average Annual Salary (HK$'000)
 Min.   :120.0                  
 1st Qu.:206.5                  
 Median :269.0                  
 Mean   :283.8                  
 3rd Qu.:350.5                  
 Max.   :714.0

Function	Output Focus	Tidyverse Equivalent
`head()`	Top rows	`slice_head()`
`str()`	Data types & structure	`glimpse()`

3.3 Practice: Import `hksalary.csv` Data

Step-by-Step Practice

Setup Workspace
- Create import-practice project
- Make /data subfolder
Store Data
- Download Hong Kong Graduates Annual Salary Data
- Save as hksalary.csv in /data

Import Data

library(tidyverse)
hksalary <- read_csv("data/hksalary.csv")

Rows: 368 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Academic Year, Level of Study, Broad Academic Programme Category
dbl (1): Average Annual Salary (HK$'000)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Initial Inspection
```
glimpse(hksalary)
summary(hksalary)
```

3.4 Key Functions Recap

Task	Function	Example
Load package	`library()`	`library(tidyverse)`
Read CSV	`read_csv()`	`read_csv("data/file.csv")`
View structure	`glimpse()`/`str()`	`glimpse(df)`
Show first rows	`head()`	`head(df, n = 10)`
Statistical summary	`summary()`	`summary(df$salary)`

3.1 Importing Data in R

Built-in Datasets from Packages

Downloading External Data

From TidyTuesday

Direct from URL

Local File Import

CSV Files with readr (tidyverse) package

File Path Management

3.2 Data Inspection

First Look Tools

3.3 Practice: Import hksalary.csv Data

Step-by-Step Practice

3.4 Key Functions Recap

CSV Files with `readr` (`tidyverse`) package

3.3 Practice: Import `hksalary.csv` Data