如何创建硬编码的语法校正函数,以便您给它一个字符标题列,它会给您一个校正后的版本?

问题描述 投票:0回答:1

设置

我正在使用 R 和 Tidverse,并且我正在编写 Tidyverse 风格的纯函数。

我有一个关系数据库,不幸的是,链接数据库的 ID 以不同的格式编写。例如“Fries,french”与“French fries”。为了解决这个问题,我想要一个标准化名称的函数,这样我就可以像这样使用它:

tibble_a_with_ids_written_well <-
  tibble_a_with_ids_written_incorrectly |>
  mutate(id = id |> name_standardizer)

之后我会执行连接:

tibble_a_with_ids_written_well |>
  left_join(tibble_b, by = "id")

name_standardizer
函数包含一个包含两列的硬编码小标题,每次我遇到一种新的写东西的方式时,我都会寻求手动更新。考虑到这一点,我编写该函数的方式如下:

name_standardizer <- function(incoming_name){
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger")
  
  hardcoded_dictionary |>
    filter(
      incoming_name == incorrect_name |
        incoming_name == correct_name) |>
    pull(correct_name)
}

问题

到目前为止,效果很好,但是当我将此函数与另一个数据库一起使用时,我收到错误。这是第三个小标题的示例:

tibble_c_with_ids_written_incorrectly <-
  tribble(~id, ~health_rating,
          "Burger", 3.8)

然后我可以更新函数的字典:

hardcoded_dictionary <-
  tribble(~incorrect_name, ~correct_name,
          "Fries, french", "French fries",
          "Hamborgar", "Hamburger",
          "Burger", NA) |> # NAs and fill mean I write down the correct name less.
  fill(correct_name) # This reduces human error.

然后,如果我运行:

tibble_c_with_ids_written_incorrectly |> mutate(id = id |> name_standardizer())
,一切都会顺利进行。然而,有了这个新的
name_standardizer
字典,我无法再标准化 tibble_a。这是错误消息:

> tibble_a_with_ids_written_well <-
+   tibble_a_with_IDs_written_incorrectly |>
+   mutate(id = id |> name_standardizer())
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `id = name_standardizer(id)`.
Caused by warning:
! There were 2 warnings in `filter()`.
The first warning was:
ℹ In argument: `incoming_name == incorrect_name | incoming_name == correct_name`.
Caused by warning in `incoming_name == incorrect_name`:
! longer object length is not a multiple of shorter object length
ℹ Run dplyr::last_dplyr_warnings() to see the 1 remaining warning. 

我的直觉是我没有成功处理矢量化函数。我在这里缺少一些东西。

我可以在我的函数中更改什么,以便我可以拥有一个硬编码的字典,并且能够以多种不同的 tibbles 的通用方式使用我的函数?

可重现的示例

这是我的代码,以便您可以有一个可重现的示例:


library(tidyverse)

tibble_a_with_IDs_written_incorrectly <-
  tribble(~id, ~price,
          "Fries, french", 2,
          "Hamborgar", 7)


name_standardizer <- function(incoming_name){
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger",
            "Burger", NA) |> # NAs and fill mean I write down the correct name less.
    fill(correct_name) # This reduces human error.
  
  hardcoded_dictionary |>
    filter(
      incoming_name == incorrect_name |
        incoming_name == correct_name) |>
    pull(correct_name)
}

tibble_a_with_ids_written_well <-
  tibble_a_with_IDs_written_incorrectly |>
  mutate(id = id |> name_standardizer())

tibble_b <-
  tribble(~id, ~amount_in_stock,
          "French fries", 4,
          "Hamburger", 3)

tibble_a_with_ids_written_well |>
  left_join(tibble_b, by = "id")

tibble_c_with_ids_written_incorrectly <-
  tribble(~id, ~health_rating,
          "Burger", 3.8)

tibble_c_with_ids_written_incorrectly |>
  mutate(id = id |> name_standardizer())

我尝试为该函数提供整个 tibble,而不是在列上使用

mutate
。这也行不通。

r tidyverse
1个回答
0
投票

==
进行两两相等性测试,所以在

1:3 == 5:7

这是内部做的

c(1 == 5, 2 == 6, 3 == 7)

当 LHS 和 RHS 的长度均为 1 或长度相同时,这种成对操作效果非常好;也就是说,length-8 和 length-1 有效,反之亦然,但 length-2 和 length-3 无效。 (R 的 sloppy 回收规则允许偶数倍数,因此

1:2 == 1:4
不会产生错误,尽管在我看来,依赖正确解释是一个非常糟糕的主意。)

所以 length-n/length-1 的例子看起来像:

1:3 == 2
c(1 == 2, 2 == 2, 3 == 2)

但是,就您而言,您正在有效地测试这一点:

c("Fries, french", "Hamborgar") == c("Fries, french", "Hamborgar", "Burger")

其中第一个是长度-2,第二个是长度-3。

所以就你的情况而言,我认为你可以这样做的一种方法是:

name_standardizer <- function(incoming_name){
  hardcoded_dictionary <-
    tribble(~incorrect_name, ~correct_name,
            "Fries, french", "French fries",
            "Hamborgar", "Hamburger",
            "Burger", NA) |> # NAs and fill mean I write down the correct name less.
    fill(correct_name) # This reduces human error.
  
  tibble(incorrect_name = incoming_name) |>
    left_join(hardcoded_dictionary, by = "incorrect_name") |> 
    mutate(correct_name = coalesce(correct_name, incorrect_name)) |> 
    pull(correct_name)
}
tibble_a_with_ids_written_well <-
  tibble_a_with_IDs_written_incorrectly |>
  mutate(id = id |> name_standardizer())
tibble_a_with_ids_written_well
# # A tibble: 2 × 2
#   id           price
#   <chr>        <dbl>
# 1 French fries     2
# 2 Hamburger        7
© www.soinside.com 2019 - 2024. All rights reserved.