3  Data Import

3.1 Importing Data in R

Built-in Datasets from Packages

R comes with a variety of built-in datasets that can be loaded directly from packages, here are some popular ones:

Package Key Datasets Load Command
datasets mtcars, iris Built-in
ggplot2 diamonds, mpg library(tidyverse)
nycflights13 flights, weather install.packages()
gapminder gapminder library(gapminder)

Usage Example:

mtcars

Downloading External Data

From TidyTuesday

#install.packages("tidytuesdayR")
library(tidytuesdayR)

# Load 2024 Olympics data
tuesdata <- tt_load('2024-08-06')  
olympics <- tuesdata$olympics

Direct from URL

library(tidyverse)

# Hong Kong graduates salary data

data_url = "https://www.ugcs.gov.hk/datagovhk/Average_Annual_Salaries_FT_Employment(Eng).csv"
hksalary_download <- read_csv(data_url)

hksalary_download

Local File Import

In data analysis projects, importing local files is more common than importing data from the web. Here are some common file types and their uses:

  • CSV Files: Simple, human-readable, and widely supported. Ideal for tabular data.
  • Excel Files: Used for spreadsheets with multiple sheets or formatting. Imported with readxl or openxlsx.
  • SPSS, SAS, Stata Files: Common in social science and survey research. Use specialized R packages to import.
  • RDS Files: Binary format for storing R objects, preserving their structure and class information.
  • RData Files: Binary format for saving multiple R objects in a single file, often used for workspaces.
Note

For this course, we will focus on CSV files, as they are simple and widely used.

CSV Files with readr (tidyverse) package

First, we download the CSV file from the web and save it locally as hksalary.csv. Then, we import it using the read_csv() function from the readr package.

read_csv() vs. read.csv()

Note that read_csv() from readr is preferred over read.csv() from base R for its speed and consistency. In this course, we recommend using read_csv() for CSV files.

# Relative path (recommended)
hksalary <- read_csv("data/hksalary.csv")
Rows: 368 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Academic Year, Level of Study, Broad Academic Programme Category
dbl (1): Average Annual Salary (HK$'000)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hksalary

File Path Management

Path Type Example When to Use
Relative data/hksalary.csv Default in projects
Absolute C:/Users/.../hksalary.csv Temporary analysis
Relative Path

Use relative paths for portability and to avoid hardcoding directory paths.

3.2 Data Inspection

After importing data, it’s essential to inspect it to understand its structure and contents. Here are some common functions to help you get started:

First Look Tools

  • head(): Shows the first few rows of the dataset.
# First 6 rows
head(hksalary)
  • glimpse(): Provides a concise summary of the dataset’s structure.
# # Tidyverse alternative to str()
glimpse(hksalary)  
Rows: 368
Columns: 4
$ `Academic Year`                     <chr> "2009/10", "2009/10", "2009/10", "…
$ `Level of Study`                    <chr> "Sub-degree", "Sub-degree", "Sub-d…
$ `Broad Academic Programme Category` <chr> "Medicine, Dentistry and Health", …
$ `Average Annual Salary (HK$'000)`   <dbl> 292, 125, 125, 139, 163, 122, 155,…
  • summary(): Displays a statistical summary of the dataset.
summary(hksalary)
 Academic Year      Level of Study     Broad Academic Programme Category
 Length:368         Length:368         Length:368                       
 Class :character   Class :character   Class :character                 
 Mode  :character   Mode  :character   Mode  :character                 
                                                                        
                                                                        
                                                                        
 Average Annual Salary (HK$'000)
 Min.   :120.0                  
 1st Qu.:206.5                  
 Median :269.0                  
 Mean   :283.8                  
 3rd Qu.:350.5                  
 Max.   :714.0                  
Function Output Focus Tidyverse Equivalent
head() Top rows slice_head()
str() Data types & structure glimpse()

3.3 Practice: Import hksalary.csv Data

Step-by-Step Practice

  1. Setup Workspace

    • Create import-practice project
    • Make /data subfolder
  2. Store Data

  3. Import Data

    library(tidyverse)
    hksalary <- read_csv("data/hksalary.csv")
    Rows: 368 Columns: 4
    ── Column specification ────────────────────────────────────────────────────────
    Delimiter: ","
    chr (3): Academic Year, Level of Study, Broad Academic Programme Category
    dbl (1): Average Annual Salary (HK$'000)
    
    ℹ Use `spec()` to retrieve the full column specification for this data.
    ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  4. Initial Inspection

    glimpse(hksalary)
    summary(hksalary)

3.4 Key Functions Recap

Task Function Example
Load package library() library(tidyverse)
Read CSV read_csv() read_csv("data/file.csv")
View structure glimpse()/str() glimpse(df)
Show first rows head() head(df, n = 10)
Statistical summary summary() summary(df$salary)