检查每个组中列的两个值是否相互跟随

问题描述 投票:0回答:2

我有一个数据集如下:

 id  date        customer_id
 1   02/03/2018   undefined
 1   04/23/2018   12
 1   05/22/2018   12
 1   06/25/2018   undefined
 2   01/14/2017   undefined
 2   02/23/2018   undefined
 2   03/04/2018   23
 2   04/04/2018   23

我想按ID分组这些数据,并按日期排序。现在这是我无法弄清楚的部分。我想要一种方法来检查每个已排序的组,customer_id的“未定义”的值后跟数字,这意味着在上面的例子中,id == 2是我想保留的,因为它有“未定义”的在此之后我们只有这个号码。我们的想法是,当customer_id未定义时,它们不是客户,而是及时,一旦成为客户,“价值”就会变为客户ID号码。所以在这种情况下,id == 1是一个糟糕的记录,我想丢弃它并保持只有id == 2。

data %>% group_by(id) %>%
         arrange(date) %>% "code to keep only records that have all 
         the undefined in customer_id together and after only numbers,
         in this case, I want to only keep id == 2 records"

谢谢。

r group-by dplyr
2个回答
1
投票

你可以尝试:

library(dplyr)

df %>%
  group_by(id) %>%
  filter(all(diff(row_number()[customer_id == 'undefined']) == 1) & customer_id[n()] != 'undefined')

输出:

# A tibble: 4 x 3
# Groups:   id [1]
     id date       customer_id
  <int> <fct>      <fct>      
1     2 01/14/2017 undefined  
2     2 02/23/2018 undefined  
3     2 03/04/2018 23         
4     2 04/04/2018 23     

此代码假定您的数据框已经安排好了。除此以外:

df %>%
  arrange(date = as.Date(date, "%m/%d/%Y")) %>%
  group_by(id) %>%
  filter(all(diff(row_number()[customer_id == 'undefined']) == 1) &
           customer_id[n()] != 'undefined')

基本上,我们所做的是检查每个组undefined案例的行号之间的差异是否总是1(即它们是连续的),以及最后一个值是否不是undefined

保留这些记录(在你的情况下id 2)。


1
投票

您可以通过检查customer_id列的运行长度编码是2还是1来实现此目的,这意味着id不会多次定义或未定义:

data <- read.table(text="id  date        customer_id
                   1   02/03/2018   undefined
                   1   04/23/2018   12
                   1   05/22/2018   12
                   1   06/25/2018   undefined
                   2   01/14/2017   undefined
                   2   02/23/2018   undefined
                   2   03/04/2018   23
                   2   04/04/2018   23", header = T, stringsAsFactors=F)

data$date <- as.Date(data$date, "%m/%d/%Y")
data$customer_id <- as.integer(data$customer_id)

data %>%
    dplyr::group_by(id) %>%
    dplyr::arrange(date, .by_group=T) %>% 
    dplyr::filter(length(rle(is.na(customer_id))$values < 3) && !is.na(tail(customer_id, 1))) 

# A tibble: 4 x 3
# Groups:   id [1]
     id date       customer_id
  <int> <date>           <int>
1     2 2017-01-14          NA
2     2 2018-02-23          NA
3     2 2018-03-04          23
4     2 2018-04-04          23

请注意,您还必须确保每个组的最后一项有效,否则从有效ID到undefined的组将通过测试。

© www.soinside.com 2019 - 2024. All rights reserved.