我有一个
data.frame
,其中一列中有字符数据。
我想从同一列中过滤data.frame
中的多个选项。有没有一种我想念的简单方法可以做到这一点?
例子:
data.frame
名字 = dat
days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn
我想过滤掉
Tom
和 Lynn
例如。target <- c("Tom", "Lynn")
filt <- filter(dat, name == target)
我得到这个错误:
longer object length is not a multiple of shorter object length
你需要
%in%
而不是==
:
library(dplyr)
target <- c("Tom", "Lynn")
filter(dat, name %in% target) # equivalently, dat %>% filter(name %in% target)
生产
days name
1 88 Lynn
2 11 Tom
3 1 Tom
4 222 Lynn
5 2 Lynn
要了解原因,请考虑这里发生的事情:
dat$name == target
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
基本上,我们将两个长度为
target
的向量循环四次以匹配dat$name
的长度。换句话说,我们在做:
Lynn == Tom
Tom == Lynn
Chris == Tom
Lisa == Lynn
... continue repeating Tom and Lynn until end of data frame
在这种情况下,我们不会收到错误,因为我怀疑您的数据框实际上有不同数量的不允许回收的行,但您提供的样本确实如此(8 行)。如果样本的行数是奇数,我会得到和你一样的错误。但即使回收有效,这显然也不是您想要的。基本上,声明
dat$name == target
等同于说:
为每个等于“Tom”的奇数值或每个等于“Lynn”的偶数值返回
。TRUE
碰巧样本数据框中的最后一个值是偶数且等于“Lynn”,因此是上面的
TRUE
。
对比,
dat$name %in% target
说:
对于
中的每个值,检查它是否存在于dat$name
中。target
很不一样。这是结果:
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE
请注意,您的问题与
dplyr
无关,只是误用了==
。
使用
base
包:
df <- data.frame(days = c(88, 11, 2, 5, 22, 1, 222, 2), name = c("Lynn", "Tom", "Chris", "Lisa", "Kyla", "Tom", "Lynn", "Lynn"))
# Three lines
target <- c("Tom", "Lynn")
index <- df$name %in% target
df[index, ]
# One line
df[df$name %in% c("Tom", "Lynn"), ]
输出:
days name
1 88 Lynn
2 11 Tom
6 1 Tom
7 222 Lynn
8 2 Lynn
使用
sqldf
:
library(sqldf)
# Two alternatives:
sqldf('SELECT *
FROM df
WHERE name = "Tom" OR name = "Lynn"')
sqldf('SELECT *
FROM df
WHERE name IN ("Tom", "Lynn")')
这可以使用 CRAN 中提供的 dplyr 包来实现。实现这一目标的简单方法:
dplyr
包。 library(dplyr)
df<- select(filter(dat,name=='tom'| name=='Lynn'), c('days','name))
说明:
所以,一旦我们下载了 dplyr,我们就可以使用这个包中的两个不同函数创建一个新的数据框:
filter:第一个参数是数据框;第二个参数是我们希望它被子集化的条件。结果是整个数据框只有我们想要的行。 select:第一个参数是数据框;第二个参数是我们要从中选择的列的名称。我们不必使用 names() 函数,甚至不必使用引号。我们只是将列名列为对象。
by_type_year_tag_filtered <- by_type_year_tag %>%
dplyr:: filter(tag_name %in% c("dplyr", "ggplot2"))
写那个。例子:
library (dplyr)
target <- YourData%>% filter (YourColum %in% c("variable1","variable2"))
您的数据示例
target <- df%>% filter (names %in% c("Tom","Lynn"))
如果您的字符串列中有长字符串作为值 您可以通过
stringr
包使用这个强大的方法。
filter( %in% )
和base R做不到的方法
library(dplyr)
library(stringr)
sentences_tb = as_tibble(sentences) %>%
mutate(row_number())
sentences_tb
# A tibble: 720 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Its easy to tell the depth of a well. 3
4 These days a chicken leg is a rare dish. 4
5 Rice is often served in round bowls. 5
6 The juice of lemons makes fine punch. 6
7 The box was thrown beside the parked truck. 7
8 The hogs were fed chopped corn and garbage. 8
9 Four hours of steady work faced us. 9
10 Large size in stockings is hard to sell. 10
# ... with 710 more rows
matching_letters <- c(
"canoe","dark","often","juice","hogs","hours","size"
)
matching_letters <- str_c(matching_letters, collapse = "|")
matching_letters
[1] "canoe|dark|often|juice|hogs|hours|size"
letters_found <- str_subset(sentences_tb$value,matching_letters)
letters_found_tb = as_tibble(letters_found)
inner_join(sentences_tb,letters_found_tb)
# A tibble: 16 x 2
value `row_number()`
<chr> <int>
1 The birch canoe slid on the smooth planks. 1
2 Glue the sheet to the dark blue background. 2
3 Rice is often served in round bowls. 5
4 The juice of lemons makes fine punch. 6
5 The hogs were fed chopped corn and garbage. 8
6 Four hours of steady work faced us. 9
7 Large size in stockings is hard to sell. 10
8 Note closely the size of the gas tank. 33
9 The bark of the pine tree was shiny and dark. 111
10 Both brothers wear the same size. 253
11 The dark pot hung in the front closet. 261
12 Grape juice and water mix well. 383
13 The wall phone rang loud and often. 454
14 The bright lanterns were gay on the dark lawn. 476
15 The pleasant hours fly by much too soon. 516
16 A six comes up more often than a ten. 609
与接受的答案比较:
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> filter(sentences_tb, value %in% target)
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> df<- select(filter(sentences_tb,value=='canoe'| value=='dark'), c('value','row_number()'))
> df
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
> target <- c("canoe","dark","often","juice","hogs","hours","size")
> index <- sentences_tb$value %in% target
> sentences_tb[index, ]
# A tibble: 0 x 2
# ... with 2 variables: value <chr>, row_number() <int>
你需要写下所有的句子才能得到想要的结果。
另一个选项可以使用
slice
和 which
来获取要过滤的值的索引。这是一些可重现的代码:
library(dplyr)
df %>%
slice(which(name %in% c("Tom", "Lynn")))
#> days name
#> 1 88 Lynn
#> 2 11 Tom
#> 3 1 Tom
#> 4 222 Lynn
#> 5 2 Lynn
创建于 2023-05-05 与 reprex v2.0.2
使用的数据:
df = read.table(text = "days name
88 Lynn
11 Tom
2 Chris
5 Lisa
22 Kyla
1 Tom
222 Lynn
2 Lynn", header = TRUE)