在R中使用group_by, nesting(), complete()和计算庞大数据集的时间间隔。

Question

我在数据上挣扎了很久，不知道如何解决我的问题。我的工作是营养数据，可以通过这个数据集来伪造。

library(tidyverse)
library(lubridate)

# Used for data generation
groupFunction <- function(cat){
  case_when(
  cat == "apple" ~ "food",
  cat == "bread" ~ "food",
  cat == "cheese" ~ "food",
  cat == "chocolate" ~ "candy",
  cat == "water" ~ "drink",
  cat == "tea" ~ "drink"
  )
}

# Generate the data

set.seed(0)

fakeData <- tibble(
  id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
  eaten_at = sample(seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/01/05'), by="15 min"), 40),
  category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
  group = groupFunction(category), 
  amount = sample(10:100, 40)
)

# For every id, for each day, every hour and each category: sum the eaten amount, 
# and keep 0 eaten amount so it is encounted in the mean calculation in step 2!
# PROBLEM: we loose time intervals where a given id didn't eat anything, this will
# biais the mean calculation in step 2!
step1 <- fakeData %>%
  mutate(eaten_at_hour = hour(eaten_at)) %>%
  group_by(id, eaten_at, eaten_at_hour, category, group) %>%
  summarise(eaten_amount = sum(amount)) %>%
  ungroup() %>%
  complete(nesting(id, eaten_at, eaten_at_hour), 
           nesting(category, group), 
           fill = list(eaten_amount = 0)) 

# For every id, mean over the days the eaten amount for every hour interval. 
# As before, keep 0 counts so it's encounted in the mean calculation in step 3!
step2 <- step1 %>%
  group_by(id, eaten_at_hour, category, group) %>%
  summarise(mean_per_id = mean(eaten_amount)) %>%
  ungroup() %>%
  complete(nesting(id, eaten_at_hour),
           nesting(category, group),
           fill = list(mean_per_id = 0))

# Mean over all id 
step3 <- step2 %>%
  group_by(eaten_at_hour, category, group) %>%
  summarise(mean_for_all = mean(mean_per_id)) %>%
  ungroup() 

# Plot the data
ggplot(step3, aes(x=eaten_at_hour, y=category, color = mean_for_all, shape = group)) +
  geom_point( size = 3) + 
  scale_color_gradient(low="blue", high="red", "Mean eaten\namount [g]")

我想建立的是一个X轴上有1小时时间间隔的图，Y轴上有不同的食物类别，以及24小时内每个X分钟内所有ID的平均食用量（即时间间隔必须灵活）。我希望得到这样的图。

我的想法是计算：

对于每个ID，
因为这个ID每天都在吃东西。
每隔X小时（即使id没有吃任何东西）。
并为每一类食物。

->将吃过的数量相加

然后，

对于每个ID，
每一类别。
并在参与的日子里，每隔1小时。

-> 平均吃的数量

然后。

-把所有的ID平均起来，这样我们就可以得到每一个类别和24小时内每隔1小时的平均吃的量

为此，我使用了group_by()、nesting()和complete()函数。但是我有3个问题，我希望能够设置所需的时间间隔，可以是15分钟，也可以是2小时。

我希望能够设置所需的时间间隔，可以是15分钟，也可以是2小时。我还没有找到任何解决方案。
我需要有所有id的所有时间间隔，即使他们没有吃任何东西（所以amout = 0），因为当我对天数或id之间的平均数时，如果我不包括零计数，平均数会被二元化。
我的实际数据集包括大约10万行，所以我认为我的方法在效率方面不是最合适的。此外，我想为这些数据设计一个闪亮的应用程序，例如，用户可以手动设置时间间隔，这意味着绘图必须一次又一次地计算（当代码效率不高时，对计算机来说有很多工作......）。

我知道我的问题完全是针对一个特定的问题，但由于我真的被堵住了，我非常感谢任何关于我的一个或两个问题的helpinputsideas。非常感谢!

Answer 1

我不确定我是否完全理解你的问题，但这里有一个回答的草稿。

首先，在区间上进行分组的一个棘手的方法是对小时进行下限（使用 lubridate::hour)除以步长，再将结果乘以步长。然后，我按id、小时、组进行分组求和，再按只有小时和组计算平均值。

eaten_n_hours = 2
df = fakeData %>% 
    mutate(hour = floor(hour(eaten_at)/eaten_n_hours)*eaten_n_hours) %>% 
    group_by(id, hour, group) %>% 
    summarise(amount = sum(amount, na.rm=TRUE)) %>% 
    group_by(hour, group) %>%
    summarise(amount_m = mean(amount, na.rm=T),
              amount_sd = sd(amount, na.rm=T)) %>%
    identity()

然后，你可以像这样绘制整个事情。

breaks_hour = seq(min(df$hour), max(df$hour)+1, eaten_n_hours)
ggplot(df, aes(x=hour, y=amount_m, group=group, color=group, fill=group))+
    geom_col(position="dodge") +
    # geom_errorbar(aes(ymin=amount_m-amount_sd, ymax=amount_m+amount_sd), position="dodge") +
    scale_x_binned(breaks=breaks_hour)

这不是最漂亮的图，但我不知道是由于我对问题的不理解，还是由于例子的原因 fakeData.

编辑

我对瓷砖不熟悉，但你可以尝试使用 geom_tiles 这种方式。另外，使用 scales::breaks_width 允许有一个灵活的时间间隔。

ggplot(df, aes(x=hour, y=group, fill=amount_m))+
    geom_tile()+
    scale_x_binned(breaks=scales::breaks_width(3)) # try other values

Answer 2

所以我想出了一个方法来实现这个功能（感谢@Dan Chaltiel），虽然不是很完美，但我会把它贴在这里，以便对其他人或对讨论有用。


library(tidyverse)
library(lubridate)

# Used for data generation
groupFunction <- function(cat){
  case_when(
    cat == "apple" ~ "food",
    cat == "bread" ~ "food",
    cat == "cheese" ~ "food",
    cat == "chocolate" ~ "candy",
    cat == "water" ~ "drink",
    cat == "tea" ~ "drink"
  )
}

# Generate the data
set.seed(0)

fakeData <- tibble(
  id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
  eaten_at = sample(seq(as.POSIXct('2020/01/01 22:00:00'), as.POSIXct('2020/01/05'), by="17 min"), 40),
  category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
  group = groupFunction(category), 
  amount = sample(10:100, 40)
)

# Set time interval in minutes here (0-60 min only): 
set_time <- 60

# Generate time sequence for one day (1440 seconds), with the desired interval input. Then set it as factor. 
timeLevels <- seq(from = as.POSIXct("2020-1-1 0:00"), by = paste(set_time, "min", sep = " "), length.out = 1440/set_time)
timeLevels <- paste(hour(timeLevels), minute(timeLevels), sep = ":")

# Calculate the means, keeping zero counts
toPlot <- fakeData %>%
  mutate(eaten_hour = floor_date(eaten_at, unit = paste(set_time, "min", sep = " ")), 
         eaten_hour = paste(hour(eaten_hour), minute(eaten_hour), sep = ":"),
         eaten_hour = factor(eaten_hour, levels = timeLevels),
         eaten_date = date(eaten_at)) %>%
  group_by(eaten_date, eaten_hour, category, group) %>%
  summarise(sum_amount = sum(amount)) %>%
  ungroup() %>%
  complete(eaten_date, eaten_hour, nesting(category, group), fill = list(sum_amount = 0)) %>%
  group_by(eaten_hour, category, group) %>%
  summarise(mean_amount = mean(sum_amount)) %>% 
  ungroup()

# Plot the data
gg <- ggplot(toPlot, aes(x=eaten_hour, y=category, fill=mean_amount))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + coord_equal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
gg <- gg + labs(x = "Time of the day", y = NULL, title = "Mean eaten quantity over one day", fill = "Mean amount [g]")
gg

输出是这样的：

仍然欢迎任何关于如何改进我的代码的意见!

在R中使用group_by, nesting(), complete()和计算庞大数据集的时间间隔。

问题描述投票：0回答：1

1个回答

编辑

最新问题

在R中使用group_by, nesting(), complete()和计算庞大数据集的时间间隔。

问题描述 投票：0回答：1

1个回答

编辑

最新问题

问题描述投票：0回答：1