在给定大量CSV编码数据的情况下,编写映射函数的最有效方法

问题描述 投票:1回答:1

想象一下,我有一个从某人提供给我的大型csv加载的数据帧,其中包含要应用于其他数据集的数据的映射/重新编码。这是csv中可能出现的一个小的可重现示例:

library(wakefield)
csv_mapping <- data.frame(
  from = as.character(name(30)),
  to = as.character(likert_7(30))  
)

以独立于CSV数据源的方式从此数据帧创建映射函数的最快方法是什么?我通常会通过运行:

dput(csv_mapping$from)
dput(csv_mapping$to)

在我的控制台中,然后将向量复制并粘贴到函数中,并使用plyr :: mapvalues()如下:

mapping_fn <- function(x) {

  fromvec <- c("Kameira", "Sanavi", "Avangelene", "Maryonna", "Wyvonna", "Enam", 
               "Yain", "Tyonna", "Shekira", "Eleanna", "Azriela", "Saajida", 
               "Chantee", "Julieanne", "Genisha", "Delesha", "Macenzi", "Alyasia", 
               "Latonga", "Josuhe", "Arter", "Stone", "Ramaj", "Lilinoe", "Zacharie", 
               "Joshuamichael", "Desseray", "Colorado", "Jaidn", "Verline")

  tovec <- c("Agree", "Somewhat Disagree", "Agree", "Agree", "Neutral", 
          "Somewhat Disagree", "Neutral", "Strongly Agree", "Somewhat Disagree", 
          "Disagree", "Strongly Disagree", "Disagree", "Somewhat Agree", 
          "Strongly Disagree", "Strongly Disagree", "Somewhat Agree", "Strongly Agree", 
          "Somewhat Agree", "Disagree", "Disagree", "Strongly Agree", "Strongly Disagree", 
          "Disagree", "Somewhat Agree", "Strongly Disagree", "Strongly Disagree", 
          "Neutral", "Somewhat Agree", "Agree", "Disagree")

  plyr::mapvalues(x, from = fromvec, to = tovec, warn_missing = F)

}

考虑到plyr现在已退休,有没有更聪明或更快速的方法而不使用mapvalues来做到这一点?

r dplyr plyr
1个回答
0
投票

一种自然的方法是使用join。如果您的数据已经存在于数据框中,则此功能特别有用,但是如果您只希望映射值的向量,则可以对它进行按摩。

说我们有一个由csv定义的映射,如下所示:

csv_mapping <- data.frame(from = c("Kameira", "Sanavi", "Avangelene", 
                                   "Maryonna", "Wyvonna"),
                          to = c("Agree", "Somewhat Disagree", "Agree",
                                 "Agree", "Neutral"))

csv_mapping
#>         from                to
#> 1    Kameira             Agree
#> 2     Sanavi Somewhat Disagree
#> 3 Avangelene             Agree
#> 4   Maryonna             Agree
#> 5    Wyvonna           Neutral

然后说我们有一个数据框df,其中列x给出了我们想要映射到新值的值。请注意,df也可以包含其他列,在这种情况下,我们将添加一些随机值以进行反演示。

df <- data.frame(x = c("Sanavi", "Maryonna", "Maryonna", "Wyvonna",
                       "Kameira","Avangelene", "Sanavi", "Wyvonna"),
                 vals = rnorm(8))

df
#>            x        vals
#> 1     Sanavi -0.95005745
#> 2   Maryonna -0.20650715
#> 3   Maryonna -0.07755789
#> 4    Wyvonna  1.72379970
#> 5    Kameira -1.36642679
#> 6 Avangelene -1.48638577
#> 7     Sanavi  0.16987157
#> 8    Wyvonna -0.55194346

然后,我们可以使用dplyr的left_join将映射的值引入数据帧。 (您可以阅读更多here)。

dplyr::left_join(df, csv_mapping, by = c("x" = "from"))
#>            x        vals                to
#> 1     Sanavi -0.95005745 Somewhat Disagree
#> 2   Maryonna -0.20650715             Agree
#> 3   Maryonna -0.07755789             Agree
#> 4    Wyvonna  1.72379970           Neutral
#> 5    Kameira -1.36642679             Agree
#> 6 Avangelene -1.48638577             Agree
#> 7     Sanavi  0.16987157 Somewhat Disagree
#> 8    Wyvonna -0.55194346           Neutral

至此,您已从给定映射中获得每个x值的对应to值。如果只需要这些to值,则只需从数据框中提取to列即可。

reprex package(v0.3.0)在2020-06-03创建

© www.soinside.com 2019 - 2024. All rights reserved.