如何在不重复任何一个唯一 ID 的情况下迭代两个唯一 ID 变量以找到最大值

Question

我合并了两个数据集，每个数据集都有自己唯一的 ID 变量：topic 和 index。接下来，我想根据第三个数值变量 value 的最大值找到这些 ID 变量的唯一配对。这通常通过

group_by

、

arrange

和

top_n

/

slice_max

的组合来完成。但是，这样做会返回 topic 或 index 的重复命中。当重复发生时，我更愿意根据下一个最高的 value 将 topic 与 index 匹配，直到我的所有 54 个主题都分配给唯一索引。例如，在下面的示例中，index 349 在 topic 33 和 topic 2 的前两行中重复。我想保留 index 349 分配给 topic 33，但是然后 topic 2 将分配给具有下一个最高值的 index，即 index 347（示例中的第 4 行）。我如何在整个数据框的代码中完成此操作？

样品

df <- structure(list(topic = c(33L, 2L, 33L, 2L, 33L, 13L, 33L, 2L, 
2L, 2L, 42L, 13L, 33L), index = c(349, 349, 363, 347, 342, 369, 
321, 366, 321, 363, 344, 370, 366), value = c(0.210311631079167, 
0.204938177956459, 0.201678820628508, 0.160801031631647, 0.160747075179686, 
0.154814646522019, 0.154102617910918, 0.137730410377001, 0.126294470150952, 
0.123695668664189, 0.110965846294849, 0.0999091218902647, 0.099824248465453
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))

期望的输出

output <- structure(list(topic = c(33L, 2L, 13L, 42L), index = c(349, 349, 369, 344), value = c(0.210311631079167, 0.204938177956459,0.154814646522019, 0.110965846294849)),row.names = c(NA, -4L), class = c("tbl_df", "tbl","data.frame"))

这段代码明显不足（而且我上面的示例没有54个主题）：

df2 <- df %>% group_by(topic, index) %>% arrange(-value) %>% filter(top_n(54)

Answer 1

我们可以做

library(dplyr) # version >= 1.1.0
df %>%
    slice_max(value, n = 1, by = topic)

-输出

# A tibble: 4 × 3
  topic index value
  <int> <dbl> <dbl>
1    33   349 0.210
2     2   349 0.205
3    13   369 0.155
4    42   344 0.111

如何在不重复任何一个唯一 ID 的情况下迭代两个唯一 ID 变量以找到最大值

问题描述投票：0回答：1

1个回答

最新问题

如何在不重复任何一个唯一 ID 的情况下迭代两个唯一 ID 变量以找到最大值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1