在dplyr中编写group_by/summarise的代码更简洁?

问题描述 投票:0回答:2

我有一个数据框,称之为“df”,其中包含过去 5 年中约 250 只命名个体鸟类的所有检测结果,约 11000 行。 df 具有 DATE、BIRD、YEAR、MONTH、DAY 和 OUTCOME 列。 group_by/summarise 命令创建一个新表“df2”,其中每个单独的鸟都有一行,每个月一个新列,如果该月看到该鸟则包含“1”,如果该月看到该鸟则包含“0”没有检测到。这些列以“YYMM”格式命名,因此 2020 年 3 月在新表中显示为列“2003”。目前,制作表格的指令已超过 60 行。我为每个新列写一行(50 个月意味着我的命令有 50 行) - 见下文。样本数据:

df <- data.frame(DATE = as.Date(c("02/16/18","03/16/18","03/16/18","04/16/18","05/16/18","05/19/18"),
                                format = "%m/%d/%y"),
                 BIRD = c("emww","emww","oaam","bbcm","bbcm","bbcm"),
                 YEAR = c(2018,2018,2018,2018,2018,2018),
                 MONTH = c(02,03,03,04,05,05),
                 OUTCOME = c(1,0,1,1,0,0))

代码可以工作,但变得很长:

df2 <- df %>% 
  group_by(BIRD) %>% 
  summarise(
    "1802" = as.numeric(any(YEAR==2018 & MONTH == 2 & OUTCOME==1)),
    "1803" = as.numeric(any(YEAR==2018 & MONTH == 3 & OUTCOME==1)),
    "1804" = as.numeric(any(YEAR==2018 & MONTH == 4 & OUTCOME==1)),
    "1805" = as.numeric(any(YEAR==2018 & MONTH == 5 & OUTCOME==1)),
    "1806" = as.numeric(any(YEAR==2018 & MONTH == 6 & OUTCOME==1)),
    "1807" = as.numeric(any(YEAR==2018 & MONTH == 7 & OUTCOME==1)),
    "1808" = as.numeric(any(YEAR==2018 & MONTH == 8 & OUTCOME==1)))

(在五年的研究中,会有像上面这样的 60 行,我只编辑列标题,行中的年份和月份是相同的)。

我很想能够做类似的事情

startdate <- as.Date("02/16/18", format = "%m/%d/%y")
enddate <- as.Date("12/16/23", format = "%m/%d/%y")

然后让 R 写出我那几个月的大块 group_by/summarize 代码,而不是我手动编辑它。有人知道如何做到这一点(或其他更有效的方法)吗?

r dplyr group-by
2个回答
0
投票
library(tidyverse)
df |>
  mutate(time = paste0(YEAR - 2000, str_pad(MONTH, width = 2, pad = "0"))) |>
  select(-YEAR, -MONTH) |>
  summarize(value = 1*any(OUTCOME == 1), .by = c(BIRD, time)) |>
  pivot_wider(names_from = time, values_from = value, values_fill = 0)

结果

# A tibble: 3 × 5
  BIRD  `1802` `1803` `1804` `1805`
  <chr>  <dbl>  <dbl>  <dbl>  <dbl>
1 emww       1      0      0      0
2 oaam       0      1      0      0
3 bbcm       0      0      1      0

0
投票

我的解决方案涉及开始和结束日期参数,以设置在记录数据的任一侧发明多少数据。

library(tidyverse)

df <- data.frame(
  DATE = as.Date(c("02/16/18", "03/16/18", "03/16/18", "04/16/18", "05/16/18", "05/19/18"),
    format = "%m/%d/%y"
  ),
  BIRD = c("emww", "emww", "oaam", "bbcm", "bbcm", "bbcm"),
  YEAR = c(2018, 2018, 2018, 2018, 2018, 2018),
  MONTH = c(02, 03, 03, 04, 05, 05),
  OUTCOME = c(1, 0, 1, 1, 0, 0)
)

startdate <- as.Date("02/16/18", format = "%m/%d/%y")
enddate <- as.Date("12/16/23", format = "%m/%d/%y")

# solution starts here 
# get the 1802 .... type month labels we ultimately want to produce
(datevec <- enframe(seq(startdate, enddate, by = "month")) |> mutate(
  ym = 100 * (year(value) %% 100) + month(value)
) |> pull(ym) |> unique())

# use tidyr::complete to populate absent ym dates.
(df2 <- df %>% mutate(ym = factor(100 * (YEAR %% 100) + MONTH,
  levels = datevec
)) |>
  select(-YEAR, -MONTH, -DATE) |> 
  summarise(seen = 1 * any(OUTCOME == 1),.by = c(BIRD,ym)) |>
  tidyr::complete(BIRD, ym, fill = list(seen = 0)) |>
  pivot_wider(
    names_from = "ym",
    values_from = "seen"
  )
)
© www.soinside.com 2019 - 2024. All rights reserved.