我有一个包含电视观看数据的df帧,我想对重叠观看进行QC检查。让我们说同一天,同一个家庭,每个人,每分钟只应记入一个电台或频道。
例如,我想标记第8,9行,因为在一个独特的家庭中,个人似乎不可能同时观看两个电视台(62,67)(start_hour_minute)。我想知道有没有办法标记这些行?白天按个人最小的观点。
df <- data.frame(stringsAsFactors=FALSE,
date = c("2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
"2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
"2018-09-02"),
householdID = c(18101276L, 18101276L, 18102843L, 18102843L, 18102843L,
18102843L, 18104148L, 18104148L, 18104148L),
Station_id = c(74L, 74L, 62L, 74L, 74L, 74L, 62L, 62L, 67L),
IndID = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"),
Start = c(111300L, 143400L, 030000L, 034900L, 064400L, 070500L, 060400L,
075100L, 075100L),
End = c(111459L, 143759L, 033059L, 035359L, 064759L, 070559L, 060459L,
81559L, 81559L),
start_hour_minute = c(1113L, 1434L, 0300L, 0349L, 0644L, 0705L, 0604L, 0751L, 0751L),
end_hour_minute = c(1114L, 1437L, 0330L, 0353L, 0647L, 0705L, 0604L, 0815L, 0815L))
您可以按照您认为应该对应于单行的变量进行分组(例如,家庭日期 - 分钟组合),然后计算行数(或Station_id
中的唯一值)并添加flag = 1
(如果该行应该被标记),否则flag = 0
df %>%
group_by(date, householdID, start_hour_minute) %>%
mutate(flag = if_else(n() == 1, 0, 1))
或者,如果你想要除Station_id
之外的所有其他变量匹配,你可以这样做
df %>%
group_by_at(vars(-Station_id)) %>%
mutate(flag = if_else(n() == 1, 0, 1))
lubridate
包有一个inteval
类对象和%within%
函数,用于检查时间戳是否在一个区间内。您可以使用它来获取标志。
使用您在上面提供的虚拟数据......
data_out <- df %>%
# Get the hour, minute, and second values as standalone numerics.
mutate(
date = ymd(date),
Start_Hour = floor(Start / 10000),
Start_Minute = floor((Start - Start_Hour*10000) / 100),
Start_Second = (Start - Start_Hour*10000) - Start_Minute*100,
End_Hour = floor(End / 10000),
End_Minute = floor((End - End_Hour*10000) / 100),
End_Second = (End - End_Hour*10000) - End_Minute*100,
# Use the hour, minute, second values to create a start-end timestamp.
Start_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
End_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
# Create an interval object.
Watch_Interval = interval(start = Start_TS, end = End_TS)
) %>%
# Group by the IDs.
group_by(householdID, Station_id) %>%
# Flag where the household's interval overlaps with another time.
mutate(
overlap_flag = case_when(
sum(Start_TS %within% as.list(Watch_Interval)) == 0 ~ 0,
sum(Start_TS %within% as.list(Watch_Interval)) > 0 ~ 1,
TRUE ~ NA_real_
)
) %>%
# dplyr doesn't play nice with interval objects, so we should remove Watch_Interval.
select(-Watch_Interval)
使用data_out %>% filter(overlap_flag == 1)
查看标记的值。
注意:dplyr
和lubridate
包并不总能很好地结合在一起,特别是旧版本。您可能需要更新每个版本的软件包版本。