如何测试向量是否包含重复元素?

问题描述 投票:7回答:5

您如何测试向量在R中是否包含重复元素?

r vector
5个回答
17
投票

我想我找到了答案。使用plicated()函数:

a=c(3,5,7,2,7,9)
b=1:10
any(duplicated(a)) #True
any(duplicated(b)) #False

4
投票

也尝试rle(x)x中找到相同值的游程长度。


2
投票

如果您正在寻找连续的重复,则可以使用diff

a <- 1:10
b <- c(1:5, 5, 7, 8, 9, 10)
diff(a)
diff(b)

或向量中的任何位置:

length(a) == length(unique(a))
length(b) == length(unique(b))

0
投票

检查此:

> all(diff(c(1,2,3)))
[1] TRUE
Warning message:
In all(diff(c(1, 2, 3))) : coercing argument of type 'double' to logical
> all(diff(c(1,2,2,3)))
[1] FALSE
Warning message:
In all(diff(sort(c(1, 2, 4, 2, 3)))) : coercing argument of type 'double' to logical

您可以添加一些强制转换以消除警告。


0
投票

正如哈德利在评论部分中提到的:

anyDuplicated对于很长的向量会更快一些-它在找到第一个重复项时可以终止。

示例

a=c(3,5,7,2,7,9)
b=1:10
anyDuplicated(b) != 0L # TRUE
anyDuplicated(b) != 0L # FALSE

具有一百万个观察值的基准:

set.seed(2011)
x <- sample(1e7, size = 1e6, replace = TRUE)
bench::mark(
  ZNN = any(duplicated(x)),
  RL = length(x) != length(unique(x)),
  BUA = !all(diff(sort(x))),
  AD = anyDuplicated(x) != 0L
)

# A tibble: 4 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory            time     gc                
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>            <list>   <list>            
1 ZNN         64.62ms  70.04ms      11.5    11.8MB     0        8     0      693ms <lgl [1]> <df[,3] [2 x 3]>  <bch:tm> <tibble [8 x 3]>  
2 RL          66.95ms  70.67ms      12.5    15.4MB     0        7     0      561ms <lgl [1]> <df[,3] [3 x 3]>  <bch:tm> <tibble [7 x 3]>  
3 BUA         84.66ms  87.79ms      10.6      42MB     3.54     3     1      283ms <lgl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [4 x 3]>  
4 AD           2.45ms   2.87ms     314.        8MB     5.98   105     2      335ms <lgl [1]> <df[,3] [1 x 3]>  <bch:tm> <tibble [107 x 3]>

具有100个观察值的基准

set.seed(2011)
x <- sample(1e7, size = 100, replace = TRUE)

bench::mark(
  ZNN = any(duplicated(x)),
  RL = length(x) != length(unique(x)),
  BUA = !all(diff(sort(x))),
  AD = anyDuplicated(x) != 0L
)

# A tibble: 4 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory            time     gc                   
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>            <list>   <list>               
1 ZNN          7.14us   8.93us    60429.    1.48KB     6.04  9999     1    165.5ms <lgl [1]> <df[,3] [2 x 3]>  <bch:tm> <tibble [10,000 x 3]>
2 RL           8.03us   9.37us    83754.    1.92KB     0    10000     0    119.4ms <lgl [1]> <df[,3] [3 x 3]>  <bch:tm> <tibble [10,000 x 3]>
3 BUA         54.89us  61.58us     8317.    4.83KB     6.74  3701     3      445ms <lgl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [3,704 x 3]> 
4 AD            5.8us   6.69us   123838.    1.05KB     0    10000     0     80.8ms <lgl [1]> <df[,3] [1 x 3]>  <bch:tm> <tibble [10,000 x 3]>
© www.soinside.com 2019 - 2024. All rights reserved.