新来的,如果这个问题之前已经回答过,我们深表歉意。我正在尝试识别数据集中的序列。
例子:
id | 逗留时间 |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
2 | 1 |
2 | 2 |
3 | 1 |
3 | 2 |
3 | 3 |
3 | 4 |
然后我想创建一个变量,当序列仍在进行时为 0,当序列结束时为 1。
例如
id | 逗留时间 | 新变量 |
---|---|---|
1 | 1 | 0 |
1 | 2 | 0 |
1 | 3 | 1 |
2 | 1 | 0 |
2 | 2 | 1 |
3 | 1 | 0 |
3 | 2 | 0 |
3 | 3 | 0 |
3 | 4 | 1 |
提前感谢您的帮助。如果我可以提供任何其他有用的东西,请告诉我。
我对 R 很陌生,所以很抱歉,但我不知道从哪里开始。再次感谢您的帮助!
您可以使用
ave
+ max
> transform(df, newvariable = +(ave(lengthofstay, id, FUN = max) == lengthofstay))
id lengthofstay newvariable
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 0
5 2 2 1
6 3 1 0
7 3 2 0
8 3 3 0
9 3 4 1
您可以利用系列中连续元素之间的差异始终大于零这一事实,除非系列重新启动:
df <- data.frame(
id = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
lengthofstay = c(1:3, 1:2, 1:4)
)
df$newvariable <- c(diff(df$lengthofstay) <= 0, 1)
df
## id lengthofstay newvariable
## 1 1 1 0
## 2 1 2 0
## 3 1 3 1
## 4 2 1 0
## 5 2 2 1
## 6 3 1 0
## 7 3 2 0
## 8 3 3 0
## 9 3 4 1
此解决方案不使用列
id
,而仅依赖于lengthofstay
.
data.table解决方案
library(data.table)
dt = data.table(
id = c(rep(1, 3), rep(2, 2), rep(3, 4)),
lengthofstay = c(1:3, 1:2, 1:4)
)
dt[, newvariable := 0]
dt[, newvariable := lengthofstay == max(lengthofstay), by = id]
dt
使用
dplyr
:
# If you don't have `dplyr` installed run:
# install.packages("dplyr")
library(dplyr)
df <- data.frame(
id = c(1, 1, 1, 2, 2, 3, 3, 3, 3)
)
df %>%
# All computations are performed within group by `id`:
group_by(id) %>%
# `mutate` creates columns sequentially. We create `lengthofstay` first and
# then we can use `lengthofstay` in the creation of `newvariable`.
mutate(
# `row_number` returns the order of each row within group.
lengthofstay = row_number(),
# `lengthofstay == max(lengthofstay)` returns a logical with `FALSE` if the
# row is not the last element of the group, and `TRUE` otherwise.
# `as.numeric` then converts `FALSE` to 0 and `TRUE` to 1.
newvariable = as.numeric(lengthofstay == max(lengthofstay))
) %>%
# We don't need to have the data grouped anymore so we call `ungroup`.
ungroup()
这可能有点冗长,但(在我看来)也更好读。
无评论:
library(dplyr)
df <- data.frame(
id = c(1, 1, 1, 2, 2, 3, 3, 3, 3)
)
df %>%
group_by(id) %>%
mutate(
lengthofstay = row_number(),
newvariable = as.numeric(lengthofstay == max(lengthofstay))
) %>%
ungroup()
这是一个更通用的
dplyr
解决方案。它更通用,因为它按 id
分组并考虑相邻值。特别是,如果观察是 id
的最后一个观察,或者如果 lengthofstay
的下一个值不等于当前值 + 1,它会记录 1。
library(dplyr)
df <- data.frame(
id = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
lengthofstay = c(1:3, 1:2, 1:4)
)
df %>%
group_by(id) %>%
mutate(newvariable = ifelse(row_number() == n() | lengthofstay+1 != lead(lengthofstay), 1, 0))
#> # A tibble: 9 × 3
#> # Groups: id [3]
#> id lengthofstay newvariable
#> <dbl> <int> <dbl>
#> 1 1 1 0
#> 2 1 2 0
#> 3 1 3 1
#> 4 2 1 0
#> 5 2 2 1
#> 6 3 1 0
#> 7 3 2 0
#> 8 3 3 0
#> 9 3 4 1
创建于 2023-04-11 与 reprex v2.0.2