Handson for tidyverse

Author

Kazuharu Yanagimoto

Published

January 11, 2023

In this exercise, we are going to see a relationship between fertility rate and male’s unpaid work.

Both data is available in the World Bank Data:

You can load them by the WDI package.

Code
library(tidyverse)
library(WDI)

fertility <- WDI(indicator = "SP.DYN.TFRT.IN", start = 2010, end = 2019) |>
  rename(fertility = "SP.DYN.TFRT.IN")
unpaid_male <- WDI(indicator = "SG.TIM.UWRK.MA", start = 2010, end = 2019) |>
  rename(hour_unpaid = "SG.TIM.UWRK.MA")
Quarto Notebook Cache

If you like to solve exercises by compiling the notebook, set cache: true in the top yaml.

dplyr & tidyr

Q1 glimpse

Look through the data. Check

  • What type of values are in the country column
  • How many NA values are in the data
Code
fertility |> glimpse()
Rows: 2,660
Columns: 5
$ country   <chr> "Africa Eastern and Southern", "Africa Eastern and Southern"…
$ iso2c     <chr> "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", …
$ iso3c     <chr> "AFE", "AFE", "AFE", "AFE", "AFE", "AFE", "AFE", "AFE", "AFE…
$ year      <int> 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, …
$ fertility <dbl> 4.482899, 4.527707, 4.570410, 4.615671, 4.677619, 4.739863, …
Code
unpaid_male |> glimpse()
Rows: 2,660
Columns: 5
$ country     <chr> "Africa Eastern and Southern", "Africa Eastern and Souther…
$ iso2c       <chr> "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH", "ZH"…
$ iso3c       <chr> "AFE", "AFE", "AFE", "AFE", "AFE", "AFE", "AFE", "AFE", "A…
$ year        <int> 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010…
$ hour_unpaid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Q2 filter

Let’s focus only on OECD countries. Using the following vector oecd, choose only OECD-member countries.

Code
#remotes::install_github("caldwellst/whotilities")
oecd <- whoville::oecd_member_states()
Code
fertility_oecd <- fertility |>
  filter(iso3c %in% oecd)

unpaid_male_oecd <- unpaid_male |>
  filter(iso3c %in% oecd)
Q3 group_by & summarize

Now, we want to have a data point for each country. Considering the data, compute

  • Mean of fertility rate by country
  • Latest data point in unpaid_male of each country (use top_n())

I know the second one is tough. See the answer if you want.

Code
mean_fertility <- fertility_oecd |>
  group_by(country) |>
  summarize(fertility = mean(fertility, na.rm = TRUE))

mean_fertility
# A tibble: 37 × 2
   country   fertility
   <chr>         <dbl>
 1 Australia      1.82
 2 Austria        1.47
 3 Belgium        1.72
 4 Canada         1.58
 5 Chile          1.71
 6 Colombia       1.87
 7 Czechia        1.57
 8 Denmark        1.74
 9 Estonia        1.60
10 Finland        1.64
# … with 27 more rows
Code
latest_unpaid <- unpaid_male_oecd |>
  filter(!is.na(hour_unpaid)) |>
  group_by(country) |>
  top_n(n = 1, wt = year)

latest_unpaid
# A tibble: 26 × 5
# Groups:   country [26]
   country  iso2c iso3c  year hour_unpaid
   <chr>    <chr> <chr> <int>       <dbl>
 1 Belgium  BE    BEL    2013       10.1 
 2 Canada   CA    CAN    2016        9.58
 3 Chile    CL    CHL    2015        9.85
 4 Colombia CO    COL    2017        2.93
 5 Estonia  EE    EST    2010       10.8 
 6 Finland  FI    FIN    2010       10.5 
 7 France   FR    FRA    2010        9.49
 8 Germany  DE    DEU    2013       10.4 
 9 Greece   GR    GRC    2014        7.01
10 Hungary  HU    HUN    2010        7.98
# … with 16 more rows
Q4 left_join()

Merge the two data frames you got in the last question.

Code
data  <- mean_fertility |>
  left_join(latest_unpaid, by = "country")

data
# A tibble: 37 × 6
   country   fertility iso2c iso3c  year hour_unpaid
   <chr>         <dbl> <chr> <chr> <int>       <dbl>
 1 Australia      1.82 <NA>  <NA>     NA       NA   
 2 Austria        1.47 <NA>  <NA>     NA       NA   
 3 Belgium        1.72 BE    BEL    2013       10.1 
 4 Canada         1.58 CA    CAN    2016        9.58
 5 Chile          1.71 CL    CHL    2015        9.85
 6 Colombia       1.87 CO    COL    2017        2.93
 7 Czechia        1.57 <NA>  <NA>     NA       NA   
 8 Denmark        1.74 <NA>  <NA>     NA       NA   
 9 Estonia        1.60 EE    EST    2010       10.8 
10 Finland        1.64 FI    FIN    2010       10.5 
# … with 27 more rows

ggplot2

Q5 geom_point()

Plot a scatter plot of male hours of unpaid work and fertility rate. To make it better, try

  • Fit a linear line
  • Change the coordinates
  • Pick another theme
Code
data |>
  ggplot(aes(x = hour_unpaid, y = fertility)) +
  geom_point() +
  stat_smooth(formula = 'y ~ x',
              method = "lm", se = FALSE) +
  coord_cartesian(ylim = c(1.1, 2.3)) +
  theme_minimal()

Q6 (Extra) ggrepel::geom_text_repel()

Put country labels on data points. It is known that ggrepel::geom_text_repel() works better than the original geom_text().

Code
data |>
  ggplot(aes(x = hour_unpaid, y = fertility, label = country)) +
  geom_point() +
  ggrepel::geom_text_repel() +
  stat_smooth(formula = 'y ~ x',
              method = "lm", se = FALSE) +
  coord_cartesian(ylim = c(1.1, 2.3)) +
  theme_minimal()