我有一个数据框,每行代表每人的交互数据。
actions = read.table('C:/Users/Desktop/actions.csv', header = F, sep = ',', na.strings = '', stringsAsFactors = F)
每个人可以进行以下一种或多种互动:
eat, sleep, walk, jump, hop, wake, run
为每个人记录的动作长度可能如下所示:
P1: eat, sleep, sleep, sleep
P2: wake, walk, eat, walk, walk, jump, jump, run, run
P3: wake, eat, walk, jump, run, sleep
为了使长度相等,我在结尾处有NA填充:
P1: eat, sleep, sleep, sleep, NA, NA, NA, NA, NA
P2: wake, walk, eat, walk, walk, jump, jump, run, run
P3: wake, eat, walk, jump, run, sleep, NA, NA, NA
现在,我的要求是更新每人条目(行方式数据),这样就不会有两个连续的条目重复。维持秩序非常重要。我要求的输出是:
P1: eat, sleep, NA, NA, NA, NA, NA, NA, NA
P2: wake, walk, eat, walk, jump, run, NA, NA, NA
P3: wake, eat, walk, jump, run, sleep, NA, NA, NA
列名默认为V1,V2,V3 .... Vn在哪里
n = maximum length of interactions string
在上面的例子中,P2具有最大长度;所以n = 9.因此上例中的总列来自V1-V9。
输出为
dput(actions)
structure(list(V1 = c("S", "C", "R"), V2 = c("C", "C", "R"),
V3 = c("R", "C", "R"), V4 = c("S", NA, "R"), V5 = c("C",
NA, "R"), V6 = c("R", NA, NA), V7 = c("S", NA, NA), V8 = c("C",
NA, NA), V9 = c("R", NA, NA)), class = "data.frame", row.names = c(NA,-3L))
以下问题:Removing Only Adjacent Duplicates in Data Frame in R有点类似于我的,但是,有几个不同之处。即使通过合并上述问题的代码,我也无法解决我的问题。
对此有任何建议将非常感谢!
library(tidyverse)
read.csv(text=gsub(" +", "", "P1, eat, sleep, sleep, sleep, NA, NA, NA, NA, NA
P2, wake, walk, eat, walk, walk, jump, jump, run, run
P3, wake, eat, walk, jump, run, sleep, NA, NA, NA"),
header = FALSE, stringsAsFactors = FALSE) %>%
setNames(c("person", sprintf("i%s", 1:9))) %>% tbl_df() -> xdf
de_dup <- function(x) {
# remove consecutive dups and keep order
interactions <- rle(unlist(x, use.names = FALSE)[-1])$values
# fill in NAs
interactions <- c(interactions, rep(NA_character_, length(x[-1])-length(interactions)))
# return a data frame
as.data.frame(as.list(setNames(c(x[1], interactions), names(x))), stringsAsFactors=FALSE)
}
rowwise(xdf) %>%
do(de_dup(.)) %>%
ungroup()
## # A tibble: 3 x 10
## person i1 i2 i3 i4 i5 i6 i7 i8 i9
## * <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 P1 eat sleep NA NA NA NA NA NA NA
## 2 P2 wake walk eat walk jump run NA NA NA
## 3 P3 wake eat walk jump run sleep NA NA NA
要求的博览会
由于dup是跨列的,因此最直接的方法(不一定是最快或最少的内存/ CPU密集型)是逐行重新创建数据帧。
rowwise()
是一个tidyverse
函数,它将数据框按行分成几组do()
)并将其传递给我们创建的函数,以使代码更具可读性和可更新性(不像混淆内联括号内的{}
疯狂与分号与换行符)。 .
==整行x
中的de_dup()
参数将是一个命名列表(阅读do
上的文档)unlist()
的向量rle
函数,但不是第一个元素。这不是完全必要的(这个人将是独一无二的),但它具有正念逻辑,因为你知道你正在与人交往。看看rle(c("a", "a", "b", "c", "c", "c", "d))
的输出,以了解它的作用。它代表运行长度编码,它是专为像您这样的需求而构建的rle
的返回值有一个values
元素,其中包含没有NA
s的去除元素。NA
s。有很多方法可以做到这一点。我喜欢这种方式。do()
上的文档),这样我们创建一个命名的字符向量并将其转换为数据框do()
结束时,我们仍然有一个逐行分组的数据框,所以我们需要将它取消组合这是使用基础R的简单方法。我只是创建了一个函数,它将用NA
替换连续的重复项,并按所需顺序重新排列新行 -
# function to check consecutive duplicates
ccd <- function(x) {
# first value can never be duplicate so initiating to 0
test <- c(0, sapply(1:(length(x)-1), function(i) anyDuplicated(x[i:(i+1)])))
x[test > 0] <- NA_character_
x[order(test)]
}
# Original df from dput
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 S C R S C R S C R
2 C C C <NA> <NA> <NA> <NA> <NA> <NA>
3 R R R R R <NA> <NA> <NA> <NA>
for(r in 1:nrow(df)) {
df[r, ] <- ccd(as.character(df[r, ]))
}
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 S C R S C R S C R
2 C <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 R <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
对于后期的演示示例 -
df <- read.csv(
text=gsub(" +", "", "P1, eat, sleep, sleep, sleep, NA, NA, NA, NA, NA
P2, wake, walk, eat, walk, walk, jump, jump, run, run
P3, wake, eat, walk, jump, run, sleep, NA, NA, NA"),
header = FALSE, stringsAsFactors = FALSE)[, -1]
> df
V2 V3 V4 V5 V6 V7 V8 V9 V10
1 eat sleep sleep sleep <NA> <NA> <NA> <NA> <NA>
2 wake walk eat walk walk jump jump run run
3 wake eat walk jump run sleep <NA> <NA> <NA>
for(r in 1:nrow(df)) {
df[r, ] <- ccd(as.character(df[r, ]))
}
> df
V2 V3 V4 V5 V6 V7 V8 V9 V10
1 eat sleep <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 wake walk eat walk jump run <NA> <NA> <NA>
3 wake eat walk jump run sleep <NA> <NA> <NA>
dplyr
,reshape2
和base R的组合。首先,它确定所需的重复项并用NA替换它们。然后,它将非NA值向左移动。
as.data.frame(t(apply(df %>%
gather(var, val, -V1) %>%
group_by(V1) %>%
mutate(val2 = ifelse(val == lag(val), NA, val),
val2 = ifelse(var == "V2", paste(val), val2)) %>%
dcast(V1~var, value.var = "val2"), 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 P1 eat sleep <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 P2 wake walk eat walk jump run <NA> <NA> <NA>
3 P3 wake eat walk jump run sleep <NA> <NA> <NA>
数据(使用@Shree中的代码):
df <- read.csv(text = gsub(" +", "", "P1, eat, sleep, sleep, sleep, NA, NA, NA, NA, NA
P2, wake, walk, eat, walk, walk, jump, jump, run, run
P3, wake, eat, walk, jump, run, sleep, NA, NA, NA"),
header = FALSE, stringsAsFactors = FALSE)