计算已知频率和缺失数据的数据的时间戳

Question

我的数据如下，其中“S”类型的数据包含时间戳，我需要将时间戳分配给“D”行。

   type  timestamp               count
   <chr> <dttm>                  <int>
 1 $     NA                         NA
 2 D     NA                        229
 3 M     NA                         NA
 4 D     NA                        230
 5 D     NA                        231
 6 D     NA                        232
 7 D     NA                        233
 8 D     NA                        234
 9 D     NA                        235
10 D     NA                        236
11 D     NA                        237
12 D     NA                        238
13 D     NA                        239
14 S     2024-01-24 16:11:11.000    NA
15 D     NA                        241
16 D     NA                        242
17 D     NA                        243
18 D     NA                        126
19 D     NA                        127
20 S     2024-01-24 16:13:29.000    NA
21 D     NA                        128

“Count”是一个 1 字节迭代器，从 0 到 255 并重复。缺失计数表示缺失数据行。数据线以 16Hz 发送，因此每次计数迭代代表 1/16 秒。我试图使用 D 行的计数来分配正确的时间戳，以获得最近的 S 行时间戳，并通过当前 D 行和紧随 S 行的 D 行之间的计数差异来计算时间戳。通常，S 线是每秒一次，但我选择这个子集是为了显示数据的一些问题，主要是第 17 行 2:18 的差距。

我找到了一种可行的方法，但速度非常慢（每行 4 毫秒，对于跨越多天的文件，每天需要处理约 100 万行数据）。真实数据位于具有多种格式（ick）的行的文件中，并且本示例中的时间和计数是从中解析出来的。这听起来像是代码出现的问题，但遗憾的是，这个系统是真实的。

如果您想查看我的缓慢解决方案或查看更完整的数据，可以在存储库中的此文件中：https://github.com/blongworth/mlabtools/blob/main/R/time_alignment.R上面的数据经过简化，因此 repo 中的方法在不修改的情况下不适用于 reprex 数据。有一些测试，但尚未确定此 Reprex 的结果应如何。

关于如何有效地做到这一点有什么想法吗？我最终可能不得不去 data.tables，但只要我开始使用更有效的逻辑，我想我就能到达那里。

这是上面测试 df 的 dput 输出：

structure(list(type = c("$", "D", "M", "D", "D", "D", "D", "D", 
"D", "D", "D", "D", "D", "S", "D", "D", "D", "D", "D", "S", "D"
), timestamp = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 1706130671, NA, NA, NA, NA, NA, 1706130809, NA
), tzone = "America/New_York", class = c("POSIXct", "POSIXt")), 
    count = c(NA, 229L, NA, 230L, 231L, 232L, 233L, 234L, 235L, 
    236L, 237L, 238L, 239L, NA, 241L, 242L, 243L, 126L, 127L, 
    NA, 128L)), row.names = c(NA, -21L), class = c("tbl_df", 
"tbl", "data.frame"))

这是具有预期输出的示例数据：

   type  timestamp               count
   <chr> <dttm>                  <int>
 1 $     NA                         NA
 2 D     2024-01-24 16:11:10.250   229
 3 M     NA                         NA
 4 D     2024-01-24 16:11:10.312   230
 5 D     2024-01-24 16:11:10.375   231
 6 D     2024-01-24 16:11:10.437   232
 7 D     2024-01-24 16:11:10.500   233
 8 D     2024-01-24 16:11:10.562   234
 9 D     2024-01-24 16:11:10.625   235
10 D     2024-01-24 16:11:10.687   236
11 D     2024-01-24 16:11:10.750   237
12 D     2024-01-24 16:11:10.812   238
13 D     2024-01-24 16:11:10.875   239
14 S     2024-01-24 16:11:11.000    NA
15 D     2024-01-24 16:11:11.000   241
16 D     2024-01-24 16:11:11.062   242
17 D     2024-01-24 16:11:11.125   243
18 D     2024-01-24 16:13:28.875   126
19 D     2024-01-24 16:13:28.937   127
20 S     2024-01-24 16:13:29.000    NA
21 D     2024-01-24 16:13:29.000   128

Answer 1

这是一个经过一些时间戳体操的镜头。

library(dplyr)
library(tidyr) # fill
df |>
  mutate(count2 = count, nexttime = timestamp, prevtime = timestamp) |>
  tidyr::fill(count2, .direction = "updown") |>
  mutate(
    count2 = count2 + 256*cumsum(c(FALSE, diff(count2) < 0)),
    nextind = if_else(is.na(timestamp), count2[NA], count2),
    prevind = nextind
  ) |>
  tidyr::fill(prevtime, prevind, .direction = "down") |>
  tidyr::fill(nexttime, nextind, .direction = "up") |>
  mutate(
    newtimestamp = case_when(
      !is.na(timestamp) ~ timestamp,
      is.na(prevtime) | abs(count2 - nextind) < abs(count2 - prevind) ~
        nexttime + (count2 - nextind)/16,
      TRUE ~
        prevtime + (count2 - prevind)/16
    )
  ) |>
  select(names(df), newtimestamp)
# # A tibble: 21 × 4
#    type  timestamp               count newtimestamp           
#    <chr> <dttm>                  <int> <dttm>                 
#  1 $     NA                         NA 2024-01-24 16:11:10.250
#  2 D     NA                        229 2024-01-24 16:11:10.250
#  3 M     NA                         NA 2024-01-24 16:11:10.312
#  4 D     NA                        230 2024-01-24 16:11:10.312
#  5 D     NA                        231 2024-01-24 16:11:10.375
#  6 D     NA                        232 2024-01-24 16:11:10.437
#  7 D     NA                        233 2024-01-24 16:11:10.500
#  8 D     NA                        234 2024-01-24 16:11:10.562
#  9 D     NA                        235 2024-01-24 16:11:10.625
# 10 D     NA                        236 2024-01-24 16:11:10.687
# 11 D     NA                        237 2024-01-24 16:11:10.750
# 12 D     NA                        238 2024-01-24 16:11:10.812
# 13 D     NA                        239 2024-01-24 16:11:10.875
# 14 S     2024-01-24 16:11:11.000    NA 2024-01-24 16:11:11.000
# 15 D     NA                        241 2024-01-24 16:11:11.000
# 16 D     NA                        242 2024-01-24 16:11:11.062
# 17 D     NA                        243 2024-01-24 16:11:11.125
# 18 D     NA                        126 2024-01-24 16:13:28.875
# 19 D     NA                        127 2024-01-24 16:13:28.937
# 20 S     2024-01-24 16:13:29.000    NA 2024-01-24 16:13:29.000
# 21 D     NA                        128 2024-01-24 16:13:29.000

备注：

```
count2
```
只是
```
count
```
完全插值为
```
NA
```
s
```
nexttime
```
/
```
prevtime
```
的用途是向前结转和向后结转
```
timestamp
```
，直到有另一个非
```
NA
```
时间戳，我在
```
case_when
```
中选择使用哪个；
```
nextind
```
/
```
prevind
```
用于从
```
count2
```
中减去，这样我就可以计算出 1/16 秒。
```
case_when
```
确实是大多数逻辑工作的地方，确定是否应使用原始
```
timestamp
```
，或
```
(count2-nextind)/16
```
（或
```
prevind
```
）从
```
nexttime
```
（
```
prevtime
```
）开始的1/16秒。

计算已知频率和缺失数据的数据的时间戳

问题描述投票：0回答：1

1个回答

最新问题

计算已知频率和缺失数据的数据的时间戳

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1