我在数据上挣扎了很久,不知道如何解决我的问题。我的工作是营养数据,可以通过这个数据集来伪造。
library(tidyverse)
library(lubridate)
# Used for data generation
groupFunction <- function(cat){
case_when(
cat == "apple" ~ "food",
cat == "bread" ~ "food",
cat == "cheese" ~ "food",
cat == "chocolate" ~ "candy",
cat == "water" ~ "drink",
cat == "tea" ~ "drink"
)
}
# Generate the data
set.seed(0)
fakeData <- tibble(
id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
eaten_at = sample(seq(as.POSIXct('2020/01/01'), as.POSIXct('2020/01/05'), by="15 min"), 40),
category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
group = groupFunction(category),
amount = sample(10:100, 40)
)
# For every id, for each day, every hour and each category: sum the eaten amount,
# and keep 0 eaten amount so it is encounted in the mean calculation in step 2!
# PROBLEM: we loose time intervals where a given id didn't eat anything, this will
# biais the mean calculation in step 2!
step1 <- fakeData %>%
mutate(eaten_at_hour = hour(eaten_at)) %>%
group_by(id, eaten_at, eaten_at_hour, category, group) %>%
summarise(eaten_amount = sum(amount)) %>%
ungroup() %>%
complete(nesting(id, eaten_at, eaten_at_hour),
nesting(category, group),
fill = list(eaten_amount = 0))
# For every id, mean over the days the eaten amount for every hour interval.
# As before, keep 0 counts so it's encounted in the mean calculation in step 3!
step2 <- step1 %>%
group_by(id, eaten_at_hour, category, group) %>%
summarise(mean_per_id = mean(eaten_amount)) %>%
ungroup() %>%
complete(nesting(id, eaten_at_hour),
nesting(category, group),
fill = list(mean_per_id = 0))
# Mean over all id
step3 <- step2 %>%
group_by(eaten_at_hour, category, group) %>%
summarise(mean_for_all = mean(mean_per_id)) %>%
ungroup()
# Plot the data
ggplot(step3, aes(x=eaten_at_hour, y=category, color = mean_for_all, shape = group)) +
geom_point( size = 3) +
scale_color_gradient(low="blue", high="red", "Mean eaten\namount [g]")
我想建立的是一个X轴上有1小时时间间隔的图,Y轴上有不同的食物类别,以及24小时内每个X分钟内所有ID的平均食用量(即时间间隔必须灵活)。我希望得到这样的图。
我的想法是计算:
->将吃过的数量相加
然后,
-> 平均吃的数量
然后。
-把所有的ID平均起来,这样我们就可以得到每一个类别和24小时内每隔1小时的平均吃的量
为此,我使用了group_by()、nesting()和complete()函数。但是我有3个问题,我希望能够设置所需的时间间隔,可以是15分钟,也可以是2小时。
我希望能够设置所需的时间间隔,可以是15分钟,也可以是2小时。我还没有找到任何解决方案。
我需要有所有id的所有时间间隔,即使他们没有吃任何东西(所以amout = 0),因为当我对天数或id之间的平均数时,如果我不包括零计数,平均数会被二元化。
我的实际数据集包括大约10万行,所以我认为我的方法在效率方面不是最合适的。此外,我想为这些数据设计一个闪亮的应用程序,例如,用户可以手动设置时间间隔,这意味着绘图必须一次又一次地计算(当代码效率不高时,对计算机来说有很多工作......)。
我知道我的问题完全是针对一个特定的问题,但由于我真的被堵住了,我非常感谢任何关于我的一个或两个问题的helpinputsideas。非常感谢!
我不确定我是否完全理解你的问题,但这里有一个回答的草稿。
首先,在区间上进行分组的一个棘手的方法是对小时进行下限(使用 lubridate::hour
)除以步长,再将结果乘以步长。然后,我按id、小时、组进行分组求和,再按只有小时和组计算平均值。
eaten_n_hours = 2
df = fakeData %>%
mutate(hour = floor(hour(eaten_at)/eaten_n_hours)*eaten_n_hours) %>%
group_by(id, hour, group) %>%
summarise(amount = sum(amount, na.rm=TRUE)) %>%
group_by(hour, group) %>%
summarise(amount_m = mean(amount, na.rm=T),
amount_sd = sd(amount, na.rm=T)) %>%
identity()
然后,你可以像这样绘制整个事情。
breaks_hour = seq(min(df$hour), max(df$hour)+1, eaten_n_hours)
ggplot(df, aes(x=hour, y=amount_m, group=group, color=group, fill=group))+
geom_col(position="dodge") +
# geom_errorbar(aes(ymin=amount_m-amount_sd, ymax=amount_m+amount_sd), position="dodge") +
scale_x_binned(breaks=breaks_hour)
这不是最漂亮的图,但我不知道是由于我对问题的不理解,还是由于例子的原因 fakeData
.
我对瓷砖不熟悉,但你可以尝试使用 geom_tiles
这种方式。另外,使用 scales::breaks_width
允许有一个灵活的时间间隔。
ggplot(df, aes(x=hour, y=group, fill=amount_m))+
geom_tile()+
scale_x_binned(breaks=scales::breaks_width(3)) # try other values
所以我想出了一个方法来实现这个功能(感谢@Dan Chaltiel),虽然不是很完美,但我会把它贴在这里,以便对其他人或对讨论有用。
library(tidyverse)
library(lubridate)
# Used for data generation
groupFunction <- function(cat){
case_when(
cat == "apple" ~ "food",
cat == "bread" ~ "food",
cat == "cheese" ~ "food",
cat == "chocolate" ~ "candy",
cat == "water" ~ "drink",
cat == "tea" ~ "drink"
)
}
# Generate the data
set.seed(0)
fakeData <- tibble(
id = c(rep("A", 10), rep("B", 10), rep("C", 10), rep("D", 10)),
eaten_at = sample(seq(as.POSIXct('2020/01/01 22:00:00'), as.POSIXct('2020/01/05'), by="17 min"), 40),
category = sample(rep(c("apple", "bread", "cheese", "chocolate", "water", "tea"), 10), 40),
group = groupFunction(category),
amount = sample(10:100, 40)
)
# Set time interval in minutes here (0-60 min only):
set_time <- 60
# Generate time sequence for one day (1440 seconds), with the desired interval input. Then set it as factor.
timeLevels <- seq(from = as.POSIXct("2020-1-1 0:00"), by = paste(set_time, "min", sep = " "), length.out = 1440/set_time)
timeLevels <- paste(hour(timeLevels), minute(timeLevels), sep = ":")
# Calculate the means, keeping zero counts
toPlot <- fakeData %>%
mutate(eaten_hour = floor_date(eaten_at, unit = paste(set_time, "min", sep = " ")),
eaten_hour = paste(hour(eaten_hour), minute(eaten_hour), sep = ":"),
eaten_hour = factor(eaten_hour, levels = timeLevels),
eaten_date = date(eaten_at)) %>%
group_by(eaten_date, eaten_hour, category, group) %>%
summarise(sum_amount = sum(amount)) %>%
ungroup() %>%
complete(eaten_date, eaten_hour, nesting(category, group), fill = list(sum_amount = 0)) %>%
group_by(eaten_hour, category, group) %>%
summarise(mean_amount = mean(sum_amount)) %>%
ungroup()
# Plot the data
gg <- ggplot(toPlot, aes(x=eaten_hour, y=category, fill=mean_amount))
gg <- gg + geom_tile(color="white", size=0.1)
gg <- gg + coord_equal() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
gg <- gg + labs(x = "Time of the day", y = NULL, title = "Mean eaten quantity over one day", fill = "Mean amount [g]")
gg
仍然欢迎任何关于如何改进我的代码的意见!