如何在电子表格中提取带有特定文本的句子?

问题描述 投票:0回答:1

enter image description here 我有一个看起来像这样的电子表格。我想保留文件列,但只提取带有“印度”一词的句子。有没有办法做到这一点?更喜欢使用 KNIME 或 R,但对任何解决方案都很满意。

enter image description here

只提取带有“印度”的句子,但保留文件栏

r text-extraction tagging sentence knime
1个回答
0
投票

这可以使用

dplyr
包中的
str_detect()
stringr
来实现。请注意,以下代码中的“India | india”将捕获“India”和语法错误的“india”(如果存在):

library(dplyr)
library(stringr)

# Some example data
df <- data.frame(File = c(1356, 1548, 1600, 1601),
                 Text = c("Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i",
                          "The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti",
                          "Some other text",
                          "This string has india without a capital I."))

df <- df %>%
  filter(str_detect(Text, "India | india"))

df
#   File   Text
# 1 1356   Digital India is an initiative by the Government of India to ensure that Government services are made available to citizens electronically by improving online infrastructure and by i
# 2 1548   The textile industry in India traditionally, after agriculture, is the only industry that has generated huge employment for both skilled and unskilled labour. The textile industry conti
# 3 1601   This string has india without a capital I.
© www.soinside.com 2019 - 2024. All rights reserved.