我有一个数据集如下:
id date customer_id
1 02/03/2018 undefined
1 04/23/2018 12
1 05/22/2018 12
1 06/25/2018 undefined
2 01/14/2017 undefined
2 02/23/2018 undefined
2 03/04/2018 23
2 04/04/2018 23
我想按ID分组这些数据,并按日期排序。现在这是我无法弄清楚的部分。我想要一种方法来检查每个已排序的组,customer_id的“未定义”的值后跟数字,这意味着在上面的例子中,id == 2是我想保留的,因为它有“未定义”的在此之后我们只有这个号码。我们的想法是,当customer_id未定义时,它们不是客户,而是及时,一旦成为客户,“价值”就会变为客户ID号码。所以在这种情况下,id == 1是一个糟糕的记录,我想丢弃它并保持只有id == 2。
data %>% group_by(id) %>%
arrange(date) %>% "code to keep only records that have all
the undefined in customer_id together and after only numbers,
in this case, I want to only keep id == 2 records"
谢谢。
你可以尝试:
library(dplyr)
df %>%
group_by(id) %>%
filter(all(diff(row_number()[customer_id == 'undefined']) == 1) & customer_id[n()] != 'undefined')
输出:
# A tibble: 4 x 3
# Groups: id [1]
id date customer_id
<int> <fct> <fct>
1 2 01/14/2017 undefined
2 2 02/23/2018 undefined
3 2 03/04/2018 23
4 2 04/04/2018 23
此代码假定您的数据框已经安排好了。除此以外:
df %>%
arrange(date = as.Date(date, "%m/%d/%Y")) %>%
group_by(id) %>%
filter(all(diff(row_number()[customer_id == 'undefined']) == 1) &
customer_id[n()] != 'undefined')
基本上,我们所做的是检查每个组undefined
案例的行号之间的差异是否总是1(即它们是连续的),以及最后一个值是否不是undefined
。
保留这些记录(在你的情况下id
2)。
您可以通过检查customer_id
列的运行长度编码是2还是1来实现此目的,这意味着id不会多次定义或未定义:
data <- read.table(text="id date customer_id
1 02/03/2018 undefined
1 04/23/2018 12
1 05/22/2018 12
1 06/25/2018 undefined
2 01/14/2017 undefined
2 02/23/2018 undefined
2 03/04/2018 23
2 04/04/2018 23", header = T, stringsAsFactors=F)
data$date <- as.Date(data$date, "%m/%d/%Y")
data$customer_id <- as.integer(data$customer_id)
data %>%
dplyr::group_by(id) %>%
dplyr::arrange(date, .by_group=T) %>%
dplyr::filter(length(rle(is.na(customer_id))$values < 3) && !is.na(tail(customer_id, 1)))
# A tibble: 4 x 3
# Groups: id [1]
id date customer_id
<int> <date> <int>
1 2 2017-01-14 NA
2 2 2018-02-23 NA
3 2 2018-03-04 23
4 2 2018-04-04 23
请注意,您还必须确保每个组的最后一项有效,否则从有效ID到undefined
的组将通过测试。