我有一个看起来像这样的电子表格。我想保留文件列,但只提取带有“印度”一词的句子。有没有办法做到这一点?更喜欢使用 KNIME 或 R,但对任何解决方案都很满意。
只提取带有“印度”的句子,但保留文件栏
这可以使用
dplyr
包中的 str_detect()
和 stringr
来实现。请注意,以下代码中的“India | india”将捕获“India”和语法错误的“india”(如果存在):
library(dplyr)
library(stringr)
# Some example data
df <- data.frame(File = c(1356, 1548, 1600, 1601),
Text = c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i",
"The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti",
"Some other text",
"This string has india without a capital I."))
df <- df %>%
filter(str_detect(Text, "India | india"))
df
# File Text
# 1 1356 Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i
# 2 1548 The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti
# 3 1601 This string has india without a capital I.