重新定义因素水平和组内顺序

Question

这是说明的简单示例。我想要按预定顺序呈现的数据摘要。我想根据 col1 值对 col2 值进行排序，并且还包括 col1 组中不在数据中的因子水平的行（例如使用 group_by ( ..., .drop=FALSE)。col2 中的一些值出现在更多比 col1 组。没有逻辑可用于确定 col2 的顺序。你可以称之为两级因素吗？

例如，我的输入数据可能是：

df <- read.table(
  header = TRUE,
  sep=",",
  text = "
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
"
)

我需要的输出是

 col1       col2        n
 Beatles    John        0
 Beatles    Paul        1
 Beatles    George      1
 Beatles    Ringo       2
 UK Artists Gilbert     1
 UK Artists George      0
 Tunnels    Tom         2
 Tunnels    Dick        1
 Tunnels    Harry       0

下面的当然不行

col2_tunnels <- c("Tom", "Dick", "Harry")
col2_beatles <- c("John", "Paul", "George", "Ringo")
col2_artists <- c("Gilbert", "George")
col2_order <- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order <- c("Beatles", "UK Artists", "Tunnels")

df %>%
  mutate(
    col1 = factor(col1, levels = col1_order),
    col2 = factor(col2, levels = col2_order)
  ) %>%
  group_by(col1, col2, .drop = FALSE) %>%
  summarise(n = n(), )

我能看到的唯一前进方式是按 col1 级别拆分数据，并使用命名的向量列表定义 col1 的每个级别的因子顺序。在写问题时，我发现这很有效

col2_fctlist <- list(
  Tunnels = c("Tom", "Dick", "Harry"),
  Beatles = c("John", "Paul", "George", "Ringo"),
  'UK Artists' = c("Gilbert", "George")
)

x <- lapply(col1_order, function(col1grp)
  df %>% filter(col1==col1grp) %>% 
    mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %>% 
    group_by(col1, col2, .drop = FALSE) %>%
    summarise(n = n(), )
)

do.call(rbind, x)

虽然我找到了一个我认为对我有用的解决方案，但我仍然发布以防万一有人可以提供更好的解决方案？

Answer 1

不知道这个是不是比你的好！使用

data.table

，如果我首先在这样的列表中设置

col1

和

col2

所需的顺序：

l1 <- list(Beatles=data.frame(col2=c("John", "Paul", "George", "Ringo")),
           `UK Artists`=data.frame(col2=c("Gilbert", "George")),
           `Tunnels`=data.frame(col2=c("Tom", "Dick", "Harry"))

然后我可以使用

data.table

将其转换为

rblindlist

并使用与

df

的连接以按指定顺序获取所需的输出：


dt1 <- rbindlist(l1, idcol = "col1")

df[,n:=1][ dt1 , on=c("col1","col2")][, sum(n,na.rm = TRUE) , .(col1, col2)]

         col1    col2 V1
1:    Beatles    John  0
2:    Beatles    Paul  1
3:    Beatles  George  1
4:    Beatles   Ringo  2
5: UK Artists Gilbert  1
6: UK Artists  George  0
7:    Tunnels     Tom  2
8:    Tunnels    Dick  1
9:    Tunnels   Harry  0

Answer 2

带有

join

：

library(tidyverse)
enframe(col2_fctlist, name = "col1", value = "col2") %>% unnest(col2) %>% 
  left_join(df %>% count(col1, col2)) %>% 
  replace_na(list(n = 0))

        col1    col2 n
1    Tunnels     Tom 2
2    Tunnels    Dick 1
3    Tunnels   Harry 0
4    Beatles    John 0
5    Beatles    Paul 1
6    Beatles  George 1
7    Beatles   Ringo 2
8 UK Artists Gilbert 1
9 UK Artists  George 0

或与

imap_dfr

：

imap_dfr(col2_fctlist, 
         ~ df %>% 
           filter(col1 == .y) %>% 
           mutate(col2 = factor(col2, levels = .x)) %>% 
           count(col2, .drop = FALSE), 
         .id = "col1")

重新定义因素水平和组内顺序

问题描述投票：0回答：2

2个回答

最新问题

重新定义因素水平和组内顺序

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2