这是说明的简单示例。 我想要按预定顺序呈现的数据摘要。我想根据 col1 值对 col2 值进行排序,并且还包括 col1 组中不在数据中的因子水平的行(例如使用 group_by ( ..., .drop=FALSE)。col2 中的一些值出现在更多比 col1 组。没有逻辑可用于确定 col2 的顺序。你可以称之为两级因素吗?
例如,我的输入数据可能是:
df <- read.table(
header = TRUE,
sep=",",
text = "
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
"
)
我需要的输出是
col1 col2 n
Beatles John 0
Beatles Paul 1
Beatles George 1
Beatles Ringo 2
UK Artists Gilbert 1
UK Artists George 0
Tunnels Tom 2
Tunnels Dick 1
Tunnels Harry 0
下面的当然不行
col2_tunnels <- c("Tom", "Dick", "Harry")
col2_beatles <- c("John", "Paul", "George", "Ringo")
col2_artists <- c("Gilbert", "George")
col2_order <- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order <- c("Beatles", "UK Artists", "Tunnels")
df %>%
mutate(
col1 = factor(col1, levels = col1_order),
col2 = factor(col2, levels = col2_order)
) %>%
group_by(col1, col2, .drop = FALSE) %>%
summarise(n = n(), )
我能看到的唯一前进方式是按 col1 级别拆分数据,并使用命名的向量列表定义 col1 的每个级别的因子顺序。在写问题时,我发现这很有效
col2_fctlist <- list(
Tunnels = c("Tom", "Dick", "Harry"),
Beatles = c("John", "Paul", "George", "Ringo"),
'UK Artists' = c("Gilbert", "George")
)
x <- lapply(col1_order, function(col1grp)
df %>% filter(col1==col1grp) %>%
mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %>%
group_by(col1, col2, .drop = FALSE) %>%
summarise(n = n(), )
)
do.call(rbind, x)
虽然我找到了一个我认为对我有用的解决方案,但我仍然发布以防万一有人可以提供更好的解决方案?
不知道这个是不是比你的好!使用
data.table
,如果我首先在这样的列表中设置 col1
和 col2
所需的顺序:
l1 <- list(Beatles=data.frame(col2=c("John", "Paul", "George", "Ringo")),
`UK Artists`=data.frame(col2=c("Gilbert", "George")),
`Tunnels`=data.frame(col2=c("Tom", "Dick", "Harry"))
然后我可以使用
data.table
将其转换为 rblindlist
并使用与 df
的连接以按指定顺序获取所需的输出:
dt1 <- rbindlist(l1, idcol = "col1")
df[,n:=1][ dt1 , on=c("col1","col2")][, sum(n,na.rm = TRUE) , .(col1, col2)]
col1 col2 V1
1: Beatles John 0
2: Beatles Paul 1
3: Beatles George 1
4: Beatles Ringo 2
5: UK Artists Gilbert 1
6: UK Artists George 0
7: Tunnels Tom 2
8: Tunnels Dick 1
9: Tunnels Harry 0
带有
join
:
library(tidyverse)
enframe(col2_fctlist, name = "col1", value = "col2") %>% unnest(col2) %>%
left_join(df %>% count(col1, col2)) %>%
replace_na(list(n = 0))
col1 col2 n
1 Tunnels Tom 2
2 Tunnels Dick 1
3 Tunnels Harry 0
4 Beatles John 0
5 Beatles Paul 1
6 Beatles George 1
7 Beatles Ringo 2
8 UK Artists Gilbert 1
9 UK Artists George 0
或与
imap_dfr
:
imap_dfr(col2_fctlist,
~ df %>%
filter(col1 == .y) %>%
mutate(col2 = factor(col2, levels = .x)) %>%
count(col2, .drop = FALSE),
.id = "col1")