我正在尝试对分组数据集进行子集化,以便代码查看每个组,查看五个特定列,查看行在这些列中是否具有匹配值,如果多行具有 3 个或更多匹配值,它将保留第一个行并删除其他具有 3 个或更多匹配项的行。 NA 不应计入该匹配阈值。
举个例子,假设我按月份对数据集进行分组。我希望代码能够查看 1 月组,识别 A、B、C、D 和 E 列中具有 3 个或更多匹配项的行,如果有两行满足匹配阈值,则删除第二行。因此,在这种情况下,它将删除保留删除第 2 行(因为它与第 1 行有三个匹配项),同时保留所有其他行,因为它们具有 <3 matches when compared head-to-head with one another.
data <- data.frame(
Month = c("Jan", "Jan", "Jan", "Jan", "Feb"),
A = c("apple", "apple", "banana", "cherry", "peach"),
B = c("car", "car", "bike", bike", "plane"),
C = c(NA, NA, NA, "cold", "NA"),
D = c("London", "Paris", "Tokyo", "Tokyo", "New York"),
E = c("earth", "earth", "wind", "fire", NA)
)
到目前为止,这是我最好的尝试,但我很确定这是错误的。
data_cleaned <- data %>%
group_by(`Month`) %>%
filter(n() == 1 | (n() > 1 & row_number() == which.min(rowSums(!is.na(select(.,
`A`, `B`, `C`, `D`, `E` ))) >= 3)))
这是一个
data.table
方法
library(data.table)
# set to data.table format
setDT(data)
# set row id's
data[, id := rowid(Month)]
# split to list by Month value
L <- split(data, by = "Month")
# columns to compare
cols <- LETTERS[1:5]
# rbindlist binds the output of the below lapply to a singe data.table
rbindlist(
lapply(L, function(x) {
# set the NA-values to @@@ (NA == NA resuklts in FALSE, but @@@ == @@@ is TRUE)
x[is.na(x)] <- "@@@"
# create a data.table with all rows to compare to one another
compare_rows <- CJ(row = 1:nrow(x), compare_row = 1:nrow(x))
# only check rows below row
compare_rows <- compare_rows[row < compare_row, ][]
setkey(compare_rows, row, compare_row)
# self join to get the count of similar columns
compare_rows[compare_rows, same := {
sum(x[i.row, .SD, .SDcols = cols] == x[i.compare_row, .SD, .SDcols = cols])
}, by = .EACHI][]
# set the @@@ back to NA
x[x=="@@@"] <- NA_character_
# filter rows that have >3 matches with previous rows
x[!id %in% compare_rows[same > 3, compare_row]]
}))
# Month A B C D E id
# 1: Jan apple car <NA> London earth 1
# 2: Jan banana bike <NA> Tokyo wind 3
# 3: Jan cherry bike cold Tokyo fire 4
# 4: Feb peach plane <NA> New York <NA> 1
如您所见,Month == "Jan" 的第 2 行被过滤掉了