我有一个很大的数据表,其中一部分看起来像这样(有更多的列和数千行):
stop_id path changed_event_status changed_time
<i64> <char> <char> <i64>
1: 4398037956893976209 S <NA> 2405071040
2: 1500925206899141237 RT <NA> 2405071041
3: 2333532852925690131 S <NA> 2405071105
4: 4636036529075799544 TÜ <NA> 2405071044
5: 4680830034956468939 S <NA> 2405071046
6: 7584560746915960683 S c 2405071049 <- 1a: replace 2405071049
7: 2333532852925690131 RT <NA> 2405071116
8: 4747322524233582527 S <NA> 2405071100 <- 1b: with 2405071100
9: 285273127640529713 S <NA> 2405071103
10: 6134967434625106066 S <NA> 2405071101
11: 3684003552999415659 RT <NA> 2405071103 <- 2b: with 2405071103
12: 7584560746915960683 RT c 2405071058 <- 2a: replace 2405071058
13: 4680830034956468939 TÜ <NA> 2405071103
14: 8123621717351038368 S <NA> 2405071113
15: 8702942397103782624 TÜ <NA> 2405071114
16: 6134967434625106066 TÜ <NA> 2405071114
17: 4138386908727054325 S <NA> 2405071115
18: 285273127640529713 RT <NA> 2405071123
19: 2445758245483744446 S <NA> 2405071119
20: 8153934371487726263 TÜ <NA> 2405071132
21: 4138386908727054325 RT <NA> 2405071126
22: 310332233182112225 S <NA> 2405071127
stop_id path changed_event_status changed_time
对于其中
changed_event_status == c
的每一行,在具有相同 path
的所有行中,我需要找到 changed_time
列中等于或大于当前行中的最小值。例如,我需要将第 6 行中的 2405071049
替换为第 8 行中的 2405071100
,将第 12 行中的 2405071058
替换为第 11 行中的 2405071103
。
我可以使用
foreach()
循环来做到这一点,按 path
子集,按 changed_time
排序,然后找到等于或大于当前行的第一个值,但我想知道是否有更快的解决方案,而无需循环。
我查看了一些相关问题,但发现自己无法将它们适应我的具体情况。
样本数据:
library(data.table)
bahn <- fread("
stop_id path changed_event_status changed_time
4398037956893976209 S NA 2405071040
1500925206899141237 RT NA 2405071041
2333532852925690131 S NA 2405071105
4636036529075799544 TÜ NA 2405071044
4680830034956468939 S NA 2405071046
7584560746915960683 S c 2405071049
2333532852925690131 RT NA 2405071116
4747322524233582527 S NA 2405071100
285273127640529713 S NA 2405071103
6134967434625106066 S NA 2405071101
3684003552999415659 RT NA 2405071103
7584560746915960683 RT c 2405071058
4680830034956468939 TÜ NA 2405071103
8123621717351038368 S NA 2405071113
8702942397103782624 TÜ NA 2405071114
6134967434625106066 TÜ NA 2405071114
4138386908727054325 S NA 2405071115
285273127640529713 RT NA 2405071123
2445758245483744446 S NA 2405071119
8153934371487726263 TÜ NA 2405071132
4138386908727054325 RT NA 2405071126
310332233182112225 S NA 2405071127
")
循环解法:
for(i in 1:nrow(bahn)) {
if(!is.na(bahn[i, changed_event_status]) & bahn[i, changed_event_status] == "c") {
bahn[i, ]$changed_time <- sort(
bahn[
is.na(changed_event_status)
&
changed_time >= bahn[i, changed_time]
&
path == bahn[i, path]
]$changed_time
)[1]
}
}
如果您不介意重新排列表格:
setorder(bahn, path, changed_time, -changed_event_status)[
changed_event_status == "c", changed_time := NA
]
setnafill(bahn, "nocb", cols = "changed_time")
bahn
#> Index: <changed_event_status>
#> stop_id path changed_event_status changed_time
#> <i64> <char> <char> <i64>
#> 1: 1500925206899141237 RT <NA> 2405071041
#> 2: 7584560746915960683 RT c 2405071103
#> 3: 3684003552999415659 RT <NA> 2405071103
#> 4: 2333532852925690131 RT <NA> 2405071116
#> 5: 285273127640529713 RT <NA> 2405071123
#> 6: 4138386908727054325 RT <NA> 2405071126
#> 7: 4398037956893976209 S <NA> 2405071040
#> 8: 4680830034956468939 S <NA> 2405071046
#> 9: 7584560746915960683 S c 2405071100
#> 10: 4747322524233582527 S <NA> 2405071100
#> 11: 6134967434625106066 S <NA> 2405071101
#> 12: 285273127640529713 S <NA> 2405071103
#> 13: 2333532852925690131 S <NA> 2405071105
#> 14: 8123621717351038368 S <NA> 2405071113
#> 15: 4138386908727054325 S <NA> 2405071115
#> 16: 2445758245483744446 S <NA> 2405071119
#> 17: 310332233182112225 S <NA> 2405071127
#> 18: 4636036529075799544 TÜ <NA> 2405071044
#> 19: 4680830034956468939 TÜ <NA> 2405071103
#> 20: 8702942397103782624 TÜ <NA> 2405071114
#> 21: 6134967434625106066 TÜ <NA> 2405071114
#> 22: 8153934371487726263 TÜ <NA> 2405071132
#> stop_id path changed_event_status changed_time