R 中的文本挖掘：删除每个文档的第一句

Question

我有几个文件，不需要每个文件的第一句。我到目前为止找不到解决方案。

这是一个例子。数据结构如下

案例编号	文字
1	今天是美好的一天。天气晴朗。
2	今天是糟糕的一天。下雨了。

所以结果应该是这样的

案例编号	文字
1	天气晴朗。
2	下雨了。

这是示例数据集：

case_number <- c(1, 2)

text <- c("Today is a good day. It is sunny.",
          "Today is a bad day. It is rainy.")

data <- data.frame(case_number, text)

Answer 1

如果句子可能包含一些标点符号（例如缩写或数字），并且您无论如何都在使用某些文本挖掘库，那么让它处理标记化是非常有意义的。

与

{tidytext}

：

library(dplyr)
library(tidytext)

# exmple with punctuation in 1st sentence
data <- data.frame(case_number = c(1, 2),
                   text = c("Today is a good day, above avg. for sure, by 5.1 points. It is sunny.",
                            "Today is a bad day. It is rainy."))
# tokenize to sentences, converting tokens to lowercase is optional
data %>% 
  unnest_sentences(s, text)
#>   case_number                                                        s
#> 1           1 today is a good day, above avg. for sure, by 5.1 points.
#> 2           1                                             it is sunny.
#> 3           2                                      today is a bad day.
#> 4           2                                             it is rainy.

# drop 1st record of every case_number group
data %>% 
  unnest_sentences(s, text) %>% 
  filter(row_number() > 1, .by = case_number)
#>   case_number            s
#> 1           1 it is sunny.
#> 2           2 it is rainy.

^{创建于 2023-08-10，使用 reprex v2.0.2}

R 中的文本挖掘：删除每个文档的第一句

问题描述投票：0回答：1

1个回答

最新问题

R 中的文本挖掘：删除每个文档的第一句

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1