我有一个数据框,称之为“df”,其中包含过去 5 年中约 250 只命名个体鸟类的所有检测结果,约 11000 行。 df 具有 DATE、BIRD、YEAR、MONTH、DAY 和 OUTCOME 列。 group_by/summarise 命令创建一个新表“df2”,其中每个单独的鸟都有一行,每个月一个新列,如果该月看到该鸟则包含“1”,如果该月看到该鸟则包含“0”没有检测到。这些列以“YYMM”格式命名,因此 2020 年 3 月在新表中显示为列“2003”。目前,制作表格的指令已超过 60 行。我为每个新列写一行(50 个月意味着我的命令有 50 行) - 见下文。样本数据:
df <- data.frame(DATE = as.Date(c("02/16/18","03/16/18","03/16/18","04/16/18","05/16/18","05/19/18"),
format = "%m/%d/%y"),
BIRD = c("emww","emww","oaam","bbcm","bbcm","bbcm"),
YEAR = c(2018,2018,2018,2018,2018,2018),
MONTH = c(02,03,03,04,05,05),
OUTCOME = c(1,0,1,1,0,0))
代码可以工作,但变得很长:
df2 <- df %>%
group_by(BIRD) %>%
summarise(
"1802" = as.numeric(any(YEAR==2018 & MONTH == 2 & OUTCOME==1)),
"1803" = as.numeric(any(YEAR==2018 & MONTH == 3 & OUTCOME==1)),
"1804" = as.numeric(any(YEAR==2018 & MONTH == 4 & OUTCOME==1)),
"1805" = as.numeric(any(YEAR==2018 & MONTH == 5 & OUTCOME==1)),
"1806" = as.numeric(any(YEAR==2018 & MONTH == 6 & OUTCOME==1)),
"1807" = as.numeric(any(YEAR==2018 & MONTH == 7 & OUTCOME==1)),
"1808" = as.numeric(any(YEAR==2018 & MONTH == 8 & OUTCOME==1)))
(在五年的研究中,会有像上面这样的 60 行,我只编辑列标题,行中的年份和月份是相同的)。
我很想能够做类似的事情
startdate <- as.Date("02/16/18", format = "%m/%d/%y")
enddate <- as.Date("12/16/23", format = "%m/%d/%y")
然后让 R 写出我那几个月的大块 group_by/summarize 代码,而不是我手动编辑它。有人知道如何做到这一点(或其他更有效的方法)吗?
library(tidyverse)
df |>
mutate(time = paste0(YEAR - 2000, str_pad(MONTH, width = 2, pad = "0"))) |>
select(-YEAR, -MONTH) |>
summarize(value = 1*any(OUTCOME == 1), .by = c(BIRD, time)) |>
pivot_wider(names_from = time, values_from = value, values_fill = 0)
结果
# A tibble: 3 × 5
BIRD `1802` `1803` `1804` `1805`
<chr> <dbl> <dbl> <dbl> <dbl>
1 emww 1 0 0 0
2 oaam 0 1 0 0
3 bbcm 0 0 1 0
我的解决方案涉及开始和结束日期参数,以设置在记录数据的任一侧发明多少数据。
library(tidyverse)
df <- data.frame(
DATE = as.Date(c("02/16/18", "03/16/18", "03/16/18", "04/16/18", "05/16/18", "05/19/18"),
format = "%m/%d/%y"
),
BIRD = c("emww", "emww", "oaam", "bbcm", "bbcm", "bbcm"),
YEAR = c(2018, 2018, 2018, 2018, 2018, 2018),
MONTH = c(02, 03, 03, 04, 05, 05),
OUTCOME = c(1, 0, 1, 1, 0, 0)
)
startdate <- as.Date("02/16/18", format = "%m/%d/%y")
enddate <- as.Date("12/16/23", format = "%m/%d/%y")
# solution starts here
# get the 1802 .... type month labels we ultimately want to produce
(datevec <- enframe(seq(startdate, enddate, by = "month")) |> mutate(
ym = 100 * (year(value) %% 100) + month(value)
) |> pull(ym) |> unique())
# use tidyr::complete to populate absent ym dates.
(df2 <- df %>% mutate(ym = factor(100 * (YEAR %% 100) + MONTH,
levels = datevec
)) |>
select(-YEAR, -MONTH, -DATE) |>
summarise(seen = 1 * any(OUTCOME == 1),.by = c(BIRD,ym)) |>
tidyr::complete(BIRD, ym, fill = list(seen = 0)) |>
pivot_wider(
names_from = "ym",
values_from = "seen"
)
)