library(tidyverse)
15 HK University Graduate Employment Situation
15.1 Learning Objectives
- Practice on Merging two data frames
- Learn how to use
left_join
- Learn how to use
- Practice more data analysis strategies on the “employment” data set
- Explore single variables
- Learn how to use
unique
,table
,hist
,summary
, andskim
- Practice more data visualization on the “employment” data set
- Flip the axis:
coord_flip
- Reordering the axis:
fct_reorder
- Plot with facets:
facet_wrap
- Interactive Plots:
plotly
- Flip the axis:
15.2 Set up
In this week, you will continue working with the Graduate_Employment_Stituation(Eng).csv
dataset. This dataset was downloaded from DATA.GOV.HK. It contains on employment situation of graduates of full-time UGC-funded programmes in Hong Kong. The full name of the dataset is “Employment Situation of Graduates of Full-time UGC-funded Programmes by University, Level of Study, Employment Situation and Occupation”.
15.3 Data Import and Cleaning
# Use your own path
<- read_csv("data/employment.csv") df
<- df |>
df_cleaned rename(
`year` = `Academic Year`,
`university` = University,
`level` = `Level of study`,
`employment` = `Employment Situation`,
`occupation` = Occupation,
`headcount` = `Number of Graduates (Headcount)`
|>
) mutate(year_start = str_sub(year, 1, 4)) |>
mutate(year_start = as.integer(year_start)) |>
select(year_start, university, level, employment, headcount)
glimpse(df_cleaned)
Rows: 5,669
Columns: 5
$ year_start <int> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
$ university <chr> "City University of Hong Kong", "City University of Hong Ko…
$ level <chr> "Research postgraduate", "Research postgraduate", "Research…
$ employment <chr> "FT employment", "FT employment", "FT employment", "FT empl…
$ headcount <dbl> 1, 2, 2, 1, 3, 1, 3, 1, 15, 2, 18, 8, 5, 1, 1, 97, 1, 6, 4,…
15.4 Merging two data frames
For example, if you want to know what is the “Unemployment” percentage for each university in the most recent surveyed year, you may find it is hard to get the result directly from the df_cleaned
dataset. You may need to do some data manipulation to get the result.
Let’s define the “Unemployment” percentage as the number of “Unemployed” graduates divided by the total number of graduates for each university in the most recent surveyed year.
\[ \text{Unemployment Percentage} = \frac{\text{Number of Unemployed Graduates}}{\text{Total Number of Graduates}} \]
::: {.callout-note title = “Question”} So take a minute to think about how you can get the result? Can you get the percentage from the df_cleaned
dataset directly using the group_by
and summarise
functions? and why or why not? :::
n_unemployed
<- df_cleaned |>
df_unemp filter(employment == "Unemployed") |>
group_by(university, year_start) |>
summarise(n = sum(headcount))
`summarise()` has grouped output by 'university'. You can override using the
`.groups` argument.
glimpse(df_unemp)
Rows: 111
Columns: 3
Groups: university [8]
$ university <chr> "City University of Hong Kong", "City University of Hong Ko…
$ year_start <int> 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,…
$ n <dbl> 28, 29, 30, 33, 43, 58, 80, 89, 74, 140, 65, 40, 39, 45, 17…
n_total
<- df_cleaned |>
df_total group_by(university, year_start) |>
summarise(total_n = sum(headcount))
`summarise()` has grouped output by 'university'. You can override using the
`.groups` argument.
glimpse(df_total)
Rows: 112
Columns: 3
Groups: university [8]
$ university <chr> "City University of Hong Kong", "City University of Hong Ko…
$ year_start <int> 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018,…
$ total_n <dbl> 2667, 2734, 2850, 2972, 3067, 3090, 3441, 3592, 3421, 3218,…
Calculate the percentage and create a new dataframe
Merge the two dataframes
Notice that df_unemp and df_total have the same columns “university” and “year_start”. You can use the left_join
function to merge the two dataframes by these two columns.
However, df_unemp has 111 rows and df_total has 112 rows. This is because there is one university in df_unemp that is not in df_total. You can use the anti_join
function to find out which university is not in df_total.
table(df_unemp$university, df_unemp$year_start)
table(df_total$university, df_total$year_start)
So here we found the data for “The Education University of Hong Kong” in 2009 is missing. So we can remove the 2009 from the dataset just for the purpose of this analysis.
<- df_unemp |>
df_unemp filter(year_start != 2009)
<- df_total |>
df_total filter(year_start != 2009)
<- df_unemp |>
df_percentage left_join(df_total, by = c("university", "year_start")) |>
mutate(percentage = n / total_n * 100)
df_percentage
15.5 More on data explore
So far, you have learnt how to explore a dataset by looking at the first few rows, the column details. Often, you may want to explore the dataset in more detail. Especially for columns/variables. You should notice that for different types of columns, you may need to use different strategies and functions to explore them. For example, for a numeric
column, you may want to know the mean
, median
, min
, max
, etc. For a character
column, you may want to know the unique values, the frequency of each unique value, etc.
unique()
For example, in the df
dataset, you may want to explore the Occupation
column. You can use the unique()
function to get all the unique values in the Occupation
column.
unique()
is a function that returns a vector of the unique values in a vector or data frame column. It is provided by the base
R, so you don’t need to load any additional packages to use it.
unique(df$University)
[1] "City University of Hong Kong"
[2] "Hong Kong Baptist University"
[3] "Lingnan University"
[4] "The Chinese University of Hong Kong"
[5] "The Education University of Hong Kong"
[6] "The Hong Kong Polytechnic University"
[7] "The Hong Kong University of Science and Technology"
[8] "The University of Hong Kong"
table()
You may also want to know the frequency of each unique value in the Occupation
column. You can use the table()
function to get the frequency of each unique value in the Occupation
column.
table()
is a function that returns the frequency of each unique value in a vector or data frame column. It is provided by the base
R, so you don’t need to load any additional packages to use it.
table(df$Occupation)
Accountants and Auditors
134
Agriculture and Fishery Workers
34
Aircraft and Ship Officers
74
Architects and Surveyors
153
Artists
159
Authors, Journalists and Related Writers
182
Business Professionals
259
Clerical and Related Workers
175
Economists, Statisticians and Mathematicians
171
Engineering Technicians
159
Engineers
193
Government Administrators and Diplomats
33
Government Officials (General Grades)
161
Insurance, Real Estate, and Business Services
151
Legal Workers
124
Managers
199
Medical and Health Workers
195
Not in Full-time Employment
1186
Occupations not elsewhere classified
49
Other Professionals and Technical Workers
205
Other Service Workers
136
Physical and Life Science Technicians
155
Physical and Life Scientists
159
Production and Related Workers
76
Protective Service Workers
138
Religious Workers
69
Sales/Service Workers
155
System Analysts and Computer Programmers
199
Teaching Profession
315
Workers not reporting any occupation
210
Workers reporting occupations unidentifiable or inadequately described
61
You can use table() to also get the frequency of unique value for a combination of two columns. For example, if you want to know the frequency of each unique value in the Occupation
column for each University
, you can use the following code: table(df$University, df$Occupation)
.
extension
Can you use group_by
and summarise
to get the same result?
|>
df group_by(Occupation) |>
summarise(n = n())
summary()
Notice that the unique
and table
functions are particularly useful for character
columns. For numeric
columns, the exploration can be very straightforward, such as getting the mean
, median
, min
, max
, etc. For example, if you want to explore the percentage
column in the df_percentage
dataset.
mean(df_percentage$percentage)
[1] 1.69358
median(df_percentage$percentage)
[1] 1.435754
min(df_percentage$percentage)
[1] 0.240096
max(df_percentage$percentage)
[1] 4.75382
Also, there is a very handy function called summary()
that can give you a summary of the numeric
column. It provides the min
, 1st Qu
, median
, mean
, 3rd Qu
, max
for the numeric
column.
summary(df_percentage$percentage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2401 1.0485 1.4358 1.6936 2.2880 4.7538
1st Qu
and 3rd Qu
are the first and third quartiles, respectively. They are the values that divide the data into four equal parts. The median
is the middle value of the data. The min
and max
are the minimum and maximum values of the data, respectively.
hist()
For a numeric
column, you may also want to know the distribution of the data. You can use the hist()
function to plot a histogram of the data.
hist(df_percentage$percentage)
::: {.callout-note title = “Question”} Can you dont use hist()
on a character
column to get the distribution of the data, and why? :::
skim
package
Actually, there are some packages that can help you explore the data more efficiently. For example, the skimr
package provides a more comprehensive summary of the data.
First, lets load the skimr
package.
# install the package if you haven't installed it
# run install.packages("skim") in console, don't run this in the R markdown
library(skimr)
Then, you can simply run the skim()
function on the dataset to get a comprehensive summary of the data. Let’s try the skim()
function on the df
dataset.
skim(df)
Name | df |
Number of rows | 5669 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Academic Year | 0 | 1 | 7 | 7 | 0 | 14 | 0 |
University | 0 | 1 | 18 | 50 | 0 | 8 | 0 |
Level of study | 0 | 1 | 10 | 21 | 0 | 4 | 0 |
Employment Situation | 0 | 1 | 6 | 15 | 0 | 5 | 0 |
Occupation | 0 | 1 | 7 | 70 | 0 | 31 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Number of Graduates (Headcount) | 0 | 1 | 51.34 | 107.38 | 1 | 2 | 11 | 46 | 1014 | ▇▁▁▁▁ |
You will see the output including three pages: overall description, two one_skim_df
. - The overall description provides the summary of the dataset, including the number of rows, columns, missing values, etc. - The second page page shows details for character columns. The useful information includes the number of missing values, the number of unique values. You may find this page also shows information about min
and max
, however, this is not about the min and max of the data, but the min and max of the length of the string in the column. - The third page shows a description of all the numeric
columns. You will find in additional to n_missing, it also provides the mean
, sd
, p0
, p25
, p50
, p75
.
15.6 Data visualization
Bar Plot
- Create a bar plot to show the percentage of graduates who are “Unemployed” for each university in the most recent surveyed year (2022). Use the
df_percentage
dataframe.
|>
df_percentage filter(year_start == 2022) |>
ggplot(aes(x = university, y = percentage)) +
geom_col() +
labs(title = "Unemployment Percentage by University in 2022/23",
x = "University",
y = "Percentage")
Bar Plot + coord_flip()
|>
df_percentage filter(year_start == 2022) |>
ggplot(aes(x = university, y = percentage)) +
geom_col() +
1coord_flip() +
labs(title = "Unemployment Percentage by University in 2022/23",
x = " ",
y = "Percentage (%)")
- 1
-
coord_flip()
is a function that flips the x and y axes. It is useful when you have a lot of categories on the x-axis and the labels are long, so it is better to put them on the y-axis.
reordering the bars
|>
df_percentage filter(year_start == 2022) |>
1ggplot(aes(x = fct_reorder(university, percentage), y = percentage)) +
geom_col() +
coord_flip() +
labs(title = "Unemployment Percentage by University in 2022/23",
x = " ",
y = "Percentage (%)")
- 1
-
fct_reorder()
is a function that reorders the levels of a factor based on a summary statistic. In this case, we are reordering the levels of theuniversity
factor based on thepercentage
column.
fct_reorder()
is a function provided by the forcats
package, which is part of the tidyverse
. You don’t need to load the forcats
package explicitly because it is loaded automatically when you load the tidyverse
.
add labels
|>
df_percentage filter(year_start == 2022) |>
ggplot(aes(x = fct_reorder(university, percentage), y = percentage)) +
geom_col() +
1geom_text(aes(label = round(percentage, 1)), hjust = -0.1) +
coord_flip() +
labs(title = "Unemployment Percentage by University in 2022/23",
x = "University",
y = "Percentage")
- 1
-
geom_text()
is a function that adds text labels to the plot. In this case, we are adding thepercentage
values to the plot. Thehjust
argument controls the horizontal justification of the text. A value of -0.1 means the text is slightly to the right of the bar.
Line Plot
- Create a line plot to show the percentage of Unemployed graduates for HKU over the years.
|>
df_percentage filter(university == "The University of Hong Kong") |>
ggplot(aes(x = year_start, y = percentage)) +
geom_line() +
1scale_x_continuous(breaks = seq(2009, 2022, 1)) +
labs(title = "Unemployment Percentage of HKU Graduates Over the Years",
x = "Year",
y = "Percentage (%)") +
theme_bw()
- 1
-
scale_x_continuous()
is a function that controls the breaks (tick marks) on the x-axis. In this case, we are setting the breaks to be every year from 2009 to 2022.
Line Plot with Multiple Lines
|>
df_percentage ggplot(aes(x = year_start, y = percentage, color = university)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(2009, 2022, 1)) +
labs(title = "Unemployment Percentage of Graduates Over the Years",
x = " ",
y = "Percentage (%)") +
theme_bw() +
1theme(legend.position = "bottom")
- 1
-
theme(legend.position = "bottom")
is a function that moves the legend to the bottom of the plot. The default position is at the right of the plot.
Within the theme()
function, you can use the legend.position
argument to control the position of the legend. The possible values are “none”, “left”, “right”, “top”, “bottom”, “topright”, “topleft”, “bottomright”, “bottomleft”.
Plot with Facets
|>
df_percentage ggplot(aes(x = year_start, y = percentage, color = university)) +
geom_line() +
labs(title = "Unemployment Percentage of Graduates Over the Years",
x = "Year",
y = "Percentage (%)") +
scale_x_continuous(breaks = seq(2009, 2022, 2)) +
1facet_wrap(~university) +
theme_bw() +
theme(legend.position = "none")
- 1
-
facet_wrap(~university)
is a function that creates a separate plot for each level of theuniversity
factor. In this case, we are creating a separate plot for each university. The~
symbol is used to specify the variable that you want to facet by.
Save the plots
|>
df_percentage ggplot(aes(x = year_start, y = percentage, color = university)) +
geom_line() +
labs(title = "Unemployment Percentage of Graduates Over the Years",
x = "Year",
y = "Percentage (%)") +
scale_x_continuous(breaks = seq(2009, 2022, 2)) +
facet_wrap(~university) +
theme_bw() +
theme(legend.position = "none")
- 1
-
ggsave()
is a function that saves the plot to a file. In this case, we are saving the plot as a PNG file with a width of 10 inches, a height of 6 inches, and a resolution of 300 dpi.
1# ggsave("line_plot.png", width = 10, height = 6, dpi = 300)
dpi
stands for dots per inch, which is the resolution of the image. The higher the dpi, the higher the resolution of the image. The default dpi is 300, which is a good resolution for most purposes.
Arrange multiple plots
Sometimes, you may want to put multiple plots together for comparison or presentation. You can use the gridExtra
package to arrange multiple plots on a single page. Install the gridExtra
package if you haven’t installed it yet.
library(gridExtra)
Here you can created two plots p1
and p2
and put them together using the grid.arrange()
function.
The code grid.arrange(p1, p2, ncol = 2)
means to put p1
and p2
together in two columns.
<- df_percentage |>
p1 ggplot(aes(x = year_start, y = percentage, color = university)) +
geom_line() +
labs(title = "Unemployment Percentage of Graduates Over the Years",
x = "Year",
y = "Percentage (%)") +
scale_x_continuous(breaks = seq(2009, 2022, 2)) +
facet_wrap(~university) +
theme_bw() +
theme(legend.position = "none")
<- df_percentage |>
p2 ggplot(aes(x = year_start, y = percentage, color = university)) +
geom_line() +
labs(title = "Unemployment Percentage of Graduates Over the Years",
x = "Year",
y = "Percentage (%)") +
scale_x_continuous(breaks = seq(2009, 2022, 2)) +
theme_bw() +
theme(legend.position = "none")
# put p1 and p2 together
grid.arrange(p1, p2, ncol = 2)
Interactive Plot with plotly
When you plot using ggplot2
, you can easily convert it to an interactive plot using the plotly
package. This is super useful when you want to share your plot online or in a presentation, especially when the plot involves multiple lines or facets.
Install the plotly
package if you haven’t installed it yet. (Run install.packages("plotly")
in the console)
library(plotly)
For example, here we first create a line plot using ggplot2
, then convert it to an interactive plot using ggplotly()
.
<- df_percentage |>
p ggplot(aes(x = year_start, y = percentage, color = university)) +
geom_line() +
labs(title = "Unemployment Percentage of Graduates Over the Years",
x = "Year",
y = "Percentage (%)") +
scale_x_continuous(breaks = seq(2009, 2022, 2)) +
theme_bw() +
theme(legend.position = "none")
p
Then, you can convert the plot to an interactive plot using the ggplotly()
function. This interactive plot provides additional features such as zooming, panning, auto-scaling, etc.
ggplotly(p)