[1] 3
Lecture 12
Cornell University
INFO 2950 - Spring 2024
March 5, 2024
What are some functions you’ve learned? What are their inputs, what are their outputs?
mean()mutate()ggplot()mean()rescale01()temp_convert()Repeat yourself:
Many functions in R are vectorized, meaning they automatically operate on all elements of a vector without needing to explicitly iterate through and act on each element individually.
library(palmerpenguins)
grouped_mean <- function(df, group_var, mean_var) {
df |>
group_by(group_var) |>
summarize(mean = mean(mean_var, na.rm = TRUE))
}
penguins |> grouped_mean(species, bill_length_mm)Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `group_var` is not found.
Accessing data variables as if they were variables in the environment
my_variable instead of df$my_variable
Base R requires identifying the location of variables in data frames
Want to access a variable without referring to the data frame as well?
Data-masking: this is used in functions like arrange(), filter(), and summarize() that compute with variables.
Tidy-selection: this is used for functions like select(), relocate(), and rename() that select variables.
summary6 <- function(data, var) {
data |> summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
penguins |> summary6(body_mass_g)# A tibble: 1 × 6
min mean median max n n_miss
<int> <dbl> <dbl> <int> <int> <int>
1 2700 4202. 4050 6300 344 2
count_missing <- function(df, group_vars, x_var) {
df |>
group_by({{ group_vars }}) |>
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
penguins |>
count_missing(c(species, island), bill_length_mm)Error in `group_by()`:
ℹ In argument: `c(species, island)`.
Caused by error:
! `c(species, island)` must be size 344 or 1, not 688.
pick() for tidy-selectioncount_missing <- function(df, group_vars, x_var) {
df |>
group_by(pick({{ group_vars }})) |>
summarize(
n_miss = sum(is.na({{ x_var }})),
.groups = "drop"
)
}
penguins |>
count_missing(c(species, island), bill_length_mm)# A tibble: 5 × 3
species island n_miss
<fct> <fct> <int>
1 Adelie Biscoe 0
2 Adelie Dream 0
3 Adelie Torgersen 1
4 Chinstrap Dream 0
5 Gentoo Biscoe 1
ae-10ae-10 (repo name will be suffixed with your GitHub name).