我想用1212过滤数据帧,因此它仅包含单独列表中列出的样本。该列表具有多个值,我不知道如何执行此操作。
下面的df称为RNASeq2
RNASeq2Norm_samples Substrng_RNASeq2Norm
1 TCGA-3C-AAAU-01A-11R-A41B-07 TCGA.3C.AAAU
2 TCGA-3C-AALI-01A-11R-A41B-07 TCGA.3C.AALI
3 TCGA-3C-AALJ-01A-31R-A41B-07 TCGA.3C.AALJ
4 TCGA-3C-AALK-01A-11R-A41B-07 TCGA.3C.AALK
5 TCGA-4H-AAAK-01A-12R-A41B-07 TCGA.4H.AAAK
6 TCGA-5L-AAT0-01A-12R-A41B-07 TCGA.5L.AAT0
7 TCGA-5L-AAT1-01A-12R-A41B-07 TCGA.5L.AAT1
8 TCGA-5T-A9QA-01A-11R-A41B-07 TCGA.5T.A9QA
.
.
.
1212
列表= intersect_samples
intersect_samples: "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" ... 1097
我已经尝试过此代码,但返回了所有原始的1212个样本:
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% intersect_samples,]
还可以尝试
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% "TCGA.3C.AAAU",]
它将返回正确的行
str(RNASeq2)
'data.frame': 1212 obs. of 2 variables:
$ RNASeq2 : Factor w/ 1212 levels "TCGA-3C-AAAU-01A-11R-A41B-07",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Substrng_RNASeq2Norm: Factor w/ 1093 levels "TCGA.3C.AAAU",..: 1 2 3 4 5 6 7 8 9 10 ...
str(intersect_samples)
chr [1:1093] "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" "TCGA.4H.AAAK" ...
[AFAIK R不提供使用局部匹配(“子字符串”)在字符串向量中查找搜索字符串向量的便利功能。
[%in
不是正确的函数,如果您想在字符串中查找子字符串,因为它只比较整个字符串。
而不是使用基R的grepl
或出色的stri_detect_fixed
包的可能更快的stringi
函数。
[请注意,为了方便理解,我已经对代码和数据进行了抽象(而不是使用您的代码和数据)。
library(stringi)
pattern = c("23", "45", "999")
data <- data.frame(row_num = 1:4,
string = c("123", "234", "345", "xyz"),
stringsAsFactors = FALSE)
# row_num string
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 xyz
string <- data$string # the column that contains the values to be filtered
# Iterate over each element in pattern and apply it to the string vector.
# Returns a logical vector of the same length as string (TRUE = found, FALSE = not found)
selected <- lapply(pattern, function(x) stri_detect_fixed(string, x))
# Or shorter:
# lapply(pattern, stri_detect_fixed, str = string)
selected # show the result (it is a list of logical vectors - one per search pattern element)
# [[1]]
# [1] TRUE TRUE FALSE FALSE
#
# [[2]]
# [1] FALSE FALSE TRUE FALSE
#
# [[3]]
# [1] FALSE FALSE FALSE FALSE
# "row-wise" reduce the logical vectors into one final vector using the logical "or" operator
# WARNING: Does not handle `NA`s correctly (one NA does makes any TRUE to NA)
selected.rows <- Reduce("|", selected)
# [1] TRUE TRUE TRUE FALSE
# Use the logical vector as row selector (TRUE returns the row, FALSE ignores the row):
string[selected.rows]
# [1] 123 234 345