根据查找表替换数据框中的值

问题描述 投票:32回答:6

我在替换数据帧中的值时遇到了一些麻烦。我想基于单独的表替换值。以下是我想要做的一个例子。

我有一张桌子,每排都是顾客,每列都是他们购买的动物。让我们称这个数据帧为table

> table
#       P1     P2     P3
# 1    cat lizard parrot
# 2 lizard parrot    cat
# 3 parrot    cat lizard

我还有一张桌子,我将其称为lookUp

> lookUp
#      pet   class
# 1    cat  mammal
# 2 lizard reptile
# 3 parrot    bird

我想要做的是创建一个名为new的新表,其函数用table中的class列替换lookUp中的所有值。我自己尝试使用lapply函数,但我得到了以下警告。

new <- as.data.frame(lapply(table, function(x) {
  gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)

Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used

关于如何使这项工作的任何想法?

r dataframe lookup
6个回答
33
投票

你在问题中发布了一个方法并不错。这是一个类似的方法:

new <- df  # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])

另一种更快的方法是:

new <- df
new[] <- look$class[match(unlist(df), look$pet)]

请注意,我在两种情况下都使用空括号([])来保持new的结构(data.frame)。

(我在答案中使用df而不是tablelook而不是lookup


20
投票

另一种选择是tidyrdplyr的组合

library(dplyr)
library(tidyr)
table %>%
   gather(key = "pet") %>%
   left_join(lookup, by = "pet") %>%
   spread(key = pet, value = class)

12
投票

任何时候你有两个单独的data.frames并试图将信息从一个带到另一个,答案是合并。

在R中,每个人都有自己喜欢的合并方法。我的是data.table

此外,既然你想对很多列做这个,那么meltdcast会更快 - 而不是循环遍历列,将它应用到一个重新整形的表,然后重新整形。

library(data.table)

#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE) 
setDT(lookUp)

#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')     
       # merging
       ][lookup, new_value := i.class, on = c(value = 'pet') 
         #reform back to original shape
         ][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
#    rn      P1      P2      P3
# 1:  1  mammal reptile    bird
# 2:  2 reptile    bird  mammal
# 3:  3    bird  mammal reptile

如果您发现dcast / melt位有点令人生畏,这里的方法只是循环遍历列; dcast / melt只是回避了这个问题的循环。

setDT(table) #don't need row names this time
setDT(lookUp)

sapply(names(table), #(or to whichever are the relevant columns)
       function(cc) table[lookUp, (cc) := #merge, replace
                            #need to pass a _named_ vector to 'on', so use setNames
                            i.class, on = setNames("pet", cc)])

6
投票

创建一个命名向量,并循环遍历每一列并匹配,请参阅:

# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1    
#      cat    lizard    parrot 
# "mammal" "reptile"    "bird" 

# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL

# res
#        P1      P2      P3
# 1  mammal reptile    bird
# 2 reptile    bird  mammal
# 3    bird  mammal reptile

data

df1 <- read.table(text = "
       P1     P2     P3
 1    cat lizard parrot
 2 lizard parrot    cat
 3 parrot    cat lizard", header = TRUE)

lookUp <- read.table(text = "
      pet   class
 1    cat  mammal
 2 lizard reptile
 3 parrot    bird", header = TRUE)

0
投票

回答above显示如何在dplyr中执行此操作不能回答问题,表中充满了NA。这有效,我将不胜感激任何评论显示更好的方式:

# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>% 
    # put in long format, naming column filled with P1, P2, P3 "petCount"
    gather(key="petCount", value="pet", -customer) %>% 
    # add a new column based on the pet's class in data frame "lookup"
    left_join(lookup, by="pet") %>%
    # since you wanted to replace the values in "table" with their
    # "class", remove the pet column
    select(-pet) %>% 
    # put data back into wide format
    spread(key="petCount", value="class")

请注意,保留包含客户,宠物,宠物物种(?)及其类别的长桌可能会很有用。此示例只是向变量添加中间保存:

table$customer = seq(nrow(table))
petClasses <- table %>% 
    gather(key="petCount", value="pet", -customer) %>% 
    left_join(lookup, by="pet")

custPetClasses <- petClasses %>%
    select(-pet) %>% 
    spread(key="petCount", value="class")

0
投票

我尝试了其他方法,他们用我非常大的数据集花了很长时间。我使用以下代码:

    # make table "new" using ifelse. See data below to avoid re-typing it
    new <- ifelse(table1 =="cat", "mammal",
                        ifelse(table1 == "lizard", "reptile",
                               ifelse(table1 =="parrot", "bird", NA)))

此方法要求您为代码编写更多文本,但ifelse的矢量化使其运行得更快。您必须根据您的数据决定是否要花更多时间编写代码或等待计算机运行。如果你想确保它有效(你的iflese命令中没有任何拼写错误),你可以使用apply(new, 2, function(x) mean(is.na(x)))

数据

    # create the data table
    table1 <- read.table(text = "
       P1     P2     P3
     1    cat lizard parrot
     2 lizard parrot    cat
     3 parrot    cat lizard", header = TRUE)
© www.soinside.com 2019 - 2024. All rights reserved.