我正在使用 R 和 Tidverse,并且我正在编写 Tidyverse 风格的纯函数。
我有一个关系数据库,不幸的是,链接数据库的 ID 以不同的格式编写。例如“Fries,french”与“French fries”。为了解决这个问题,我想要一个标准化名称的函数,这样我就可以像这样使用它:
tibble_a_with_ids_written_well <-
tibble_a_with_ids_written_incorrectly |>
mutate(id = id |> name_standardizer)
之后我会执行连接:
tibble_a_with_ids_written_well |>
left_join(tibble_b, by = "id")
name_standardizer
函数包含一个包含两列的硬编码小标题,每次我遇到一种新的写东西的方式时,我都会寻求手动更新。考虑到这一点,我编写该函数的方式如下:
name_standardizer <- function(incoming_name){
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger")
hardcoded_dictionary |>
filter(
incoming_name == incorrect_name |
incoming_name == correct_name) |>
pull(correct_name)
}
到目前为止,效果很好,但是当我将此函数与另一个数据库一起使用时,我收到错误。这是第三个小标题的示例:
tibble_c_with_ids_written_incorrectly <-
tribble(~id, ~health_rating,
"Burger", 3.8)
然后我可以更新函数的字典:
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger",
"Burger", NA) |> # NAs and fill mean I write down the correct name less.
fill(correct_name) # This reduces human error.
然后,如果我运行:
tibble_c_with_ids_written_incorrectly |> mutate(id = id |> name_standardizer())
,一切都会顺利进行。然而,有了这个新的 name_standardizer
字典,我无法再标准化 tibble_a。这是错误消息:
> tibble_a_with_ids_written_well <-
+ tibble_a_with_IDs_written_incorrectly |>
+ mutate(id = id |> name_standardizer())
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `id = name_standardizer(id)`.
Caused by warning:
! There were 2 warnings in `filter()`.
The first warning was:
ℹ In argument: `incoming_name == incorrect_name | incoming_name == correct_name`.
Caused by warning in `incoming_name == incorrect_name`:
! longer object length is not a multiple of shorter object length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning.
我的直觉是我没有成功处理矢量化函数。我在这里缺少一些东西。
我可以在我的函数中更改什么,以便我可以拥有一个硬编码的字典,并且能够以多种不同的 tibbles 的通用方式使用我的函数?
这是我的代码,以便您可以有一个可重现的示例:
library(tidyverse)
tibble_a_with_IDs_written_incorrectly <-
tribble(~id, ~price,
"Fries, french", 2,
"Hamborgar", 7)
name_standardizer <- function(incoming_name){
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger",
"Burger", NA) |> # NAs and fill mean I write down the correct name less.
fill(correct_name) # This reduces human error.
hardcoded_dictionary |>
filter(
incoming_name == incorrect_name |
incoming_name == correct_name) |>
pull(correct_name)
}
tibble_a_with_ids_written_well <-
tibble_a_with_IDs_written_incorrectly |>
mutate(id = id |> name_standardizer())
tibble_b <-
tribble(~id, ~amount_in_stock,
"French fries", 4,
"Hamburger", 3)
tibble_a_with_ids_written_well |>
left_join(tibble_b, by = "id")
tibble_c_with_ids_written_incorrectly <-
tribble(~id, ~health_rating,
"Burger", 3.8)
tibble_c_with_ids_written_incorrectly |>
mutate(id = id |> name_standardizer())
我尝试为该函数提供整个 tibble,而不是在列上使用
mutate
。这也行不通。
==
进行两两相等性测试,所以在
1:3 == 5:7
这是内部做的
c(1 == 5, 2 == 6, 3 == 7)
当 LHS 和 RHS 的长度均为 1 或长度相同时,这种成对操作效果非常好;也就是说,length-8 和 length-1 有效,反之亦然,但 length-2 和 length-3 无效。 (R 的 sloppy 回收规则允许偶数倍数,因此
1:2 == 1:4
不会产生错误,尽管在我看来,依赖正确解释是一个非常糟糕的主意。)
所以 length-n/length-1 的例子看起来像:
1:3 == 2
c(1 == 2, 2 == 2, 3 == 2)
但是,就您而言,您正在有效地测试这一点:
c("Fries, french", "Hamborgar") == c("Fries, french", "Hamborgar", "Burger")
其中第一个是长度-2,第二个是长度-3。
所以就你的情况而言,我认为你可以这样做的一种方法是:
name_standardizer <- function(incoming_name){
hardcoded_dictionary <-
tribble(~incorrect_name, ~correct_name,
"Fries, french", "French fries",
"Hamborgar", "Hamburger",
"Burger", NA) |> # NAs and fill mean I write down the correct name less.
fill(correct_name) # This reduces human error.
tibble(incorrect_name = incoming_name) |>
left_join(hardcoded_dictionary, by = "incorrect_name") |>
mutate(correct_name = coalesce(correct_name, incorrect_name)) |>
pull(correct_name)
}
tibble_a_with_ids_written_well <-
tibble_a_with_IDs_written_incorrectly |>
mutate(id = id |> name_standardizer())
tibble_a_with_ids_written_well
# # A tibble: 2 × 2
# id price
# <chr> <dbl>
# 1 French fries 2
# 2 Hamburger 7