Functions

Lecture 12

Dr. Benjamin Soltoff

Cornell University
INFO 2950 - Spring 2024

March 5, 2024

Announcements

Exam 01
Project proposals

Functions

Functions in R

What are some functions you’ve learned? What are their inputs, what are their outputs?

mean()
mutate()
ggplot()

`mean()`

x <- c(1, 2, 3, 4, 5)
mean(x)

[1] 3

df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> summarize(a = mean(a), b = mean(b), c = mean(c), d = mean(d))

# A tibble: 1 × 4
      a       b     c     d
  <dbl>   <dbl> <dbl> <dbl>
1 0.194 -0.0443 0.308 0.109

df |> summarize(across(.cols = everything(), .fns = mean))

# A tibble: 1 × 4
      a       b     c     d
  <dbl>   <dbl> <dbl> <dbl>
1 0.194 -0.0443 0.308 0.109

Function components

name <- function(arguments) {
  body
}

Name
Arguments
Body

`rescale01()`

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
  d = rescale01(d),
)

# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1 0     1     1     1    
2 0.156 0.579 0.514 0.657
3 1     0     0.537 0    
4 0.298 0.194 0.374 0.711
5 0.325 0.275 0     0.398

# or with across()
df |>
  mutate(across(
    .cols = everything(),
    .fns = rescale01
  ))

# A tibble: 5 × 4
      a     b     c     d
  <dbl> <dbl> <dbl> <dbl>
1 0     1     1     1    
2 0.156 0.579 0.514 0.657
3 1     0     0.537 0    
4 0.298 0.194 0.374 0.711
5 0.325 0.275 0     0.398

Custom function: `temp_convert()`

Goal: Convert temperatures in degrees Celsius to Farenheit; multiply by $\frac{9}{5}$ and add 32.
Number and type of inputs: 1 (a numeric vector of length 1)
Number and type of outputs: 1 (a numeric vector of length 1)

temp_convert <- function(temp_c) {
  (temp_c * 9 / 5) + 32
}

Test out the function

temp_convert(0) # freezing point

[1] 32

temp_convert(220) # bread baking temperature

[1] 428

temp_convert(100) # boiling point

[1] 212

Why do we need functions?

Repeat yourself:

# freezing point
(0 * 9 / 5) + 32

[1] 32

# bread baking temperature
(220 * 9 / 5) + 32

[1] 428

# boiling point
(100 * 9 / 5) + 32

[1] 212

Do not repeat yourself (DRY):

# freezing point
temp_convert(0)

[1] 32

# bread baking temperature
temp_convert(220)

[1] 428

# boiling point
temp_convert(100)

[1] 212

Vectorized functions in R

Many functions in R are vectorized, meaning they automatically operate on all elements of a vector without needing to explicitly iterate through and act on each element individually.

x <- c(0, 100, 220)

temp_convert(x)

[1]  32 212 428

tibble(temp_c = x) |> mutate(temp_f = temp_convert(temp_c))

# A tibble: 3 × 2
  temp_c temp_f
   <dbl>  <dbl>
1      0     32
2    100    212
3    220    428

Data frame functions

Work like dplyr verbs
- First argument is a data frame
- Additional arguments about what to do with the data frame
- Output is a data frame or vector
Requires use of indirection and embracing {{ }}

Indirection and tidy evaluation

library(palmerpenguins)

grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by(group_var) |> 
    summarize(mean = mean(mean_var, na.rm = TRUE))
}

penguins |> grouped_mean(species, bill_length_mm)

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.

Data masking

Accessing data variables as if they were variables in the environment

my_variable instead of df$my_variable
Base R requires identifying the location of variables in data frames

# base R
penguins[penguins$species == "Adelie" & penguins$body_mass_g >= 4000, ]

# tidyverse + data masking
penguins |> filter(species == "Adelie", body_mass_g >= 4000)

Embracing

Want to access a variable without referring to the data frame as well?

grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by({{ group_var }}) |> 
    summarize(mean = mean({{ mean_var }}, na.rm = TRUE))
}

penguins |> grouped_mean(species, bill_length_mm)

# A tibble: 3 × 2
  species    mean
  <fct>     <dbl>
1 Adelie     38.8
2 Chinstrap  48.8
3 Gentoo     47.5

When to embrace

Data-masking: this is used in functions like arrange(), filter(), and summarize() that compute with variables.
Tidy-selection: this is used for functions like select(), relocate(), and rename() that select variables.

Common use cases

summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}

penguins |> summary6(body_mass_g)

# A tibble: 1 × 6
    min  mean median   max     n n_miss
  <int> <dbl>  <dbl> <int> <int>  <int>
1  2700 4202.   4050  6300   344      2

penguins |> group_by(species) |> summary6(body_mass_g)

# A tibble: 3 × 7
  species     min  mean median   max     n n_miss
  <fct>     <int> <dbl>  <dbl> <int> <int>  <int>
1 Adelie     2850 3701.   3700  4775   152      1
2 Chinstrap  2700 3733.   3700  4800    68      0
3 Gentoo     3950 5076.   5000  6300   124      1

Data-masking vs. tidy-selection

count_missing <- function(df, group_vars, x_var) {
  df |> 
    group_by({{ group_vars }}) |> 
    summarize(
      n_miss = sum(is.na({{ x_var }})),
      .groups = "drop"
    )
}

penguins |> 
  count_missing(c(species, island), bill_length_mm)

Error in `group_by()`:
ℹ In argument: `c(species, island)`.
Caused by error:
! `c(species, island)` must be size 344 or 1, not 688.

Use `pick()` for tidy-selection

count_missing <- function(df, group_vars, x_var) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      n_miss = sum(is.na({{ x_var }})),
      .groups = "drop"
  )
}

penguins |> 
  count_missing(c(species, island), bill_length_mm)

# A tibble: 5 × 3
  species   island    n_miss
  <fct>     <fct>      <int>
1 Adelie    Biscoe         0
2 Adelie    Dream          0
3 Adelie    Torgersen      1
4 Chinstrap Dream          0
5 Gentoo    Biscoe         1

Application exercise

`ae-10`

Go to the course GitHub org and find your ae-10 (repo name will be suffixed with your GitHub name).
Clone the repo in RStudio Workbench, open the Quarto document in the repo, and follow along and complete the exercises.
Render, commit, and push your edits by the AE deadline – end of tomorrow

Recap

Writing your own functions is a great way to make your code more readable and reusable.
Functions can be vectorized, meaning they operate on all elements of a vector without needing to explicitly iterate through and act on each element individually.
Data frame functions work like dplyr verbs, and require use of indirection and embracing {{ }}.

Functions

Announcements

Announcements

Functions

Functions in R

mean()

Function components

rescale01()

Custom function: temp_convert()

Test out the function

Why do we need functions?

Vectorized functions in R

Data frame functions

Data frame functions

Indirection and tidy evaluation

Data masking

Embracing

When to embrace

Common use cases

Data-masking vs. tidy-selection

Use pick() for tidy-selection

Application exercise

ae-10

Recap

`mean()`

`rescale01()`

Custom function: `temp_convert()`

Use `pick()` for tidy-selection

`ae-10`