R Tidymodels textrecipes - 使用 spacyR 进行标记 - 如何从生成的标记列表中删除标点符号

Question

我想通过使用带有spacyR引擎的step_tokenize来标记我的文本，然后再使用step_lemma进行词形还原。接下来，我想从标记列表中删除例如标点符号。

使用默认 tokenizers::tokenize_words 时，您可以通过 step_tokenize() 中的选项列表传递此选项。

但是，我的理解是，step_tokenize 在后端使用 spacy_parse，它不提供这样的选项。

有没有办法删除，例如使用 step_lemma() 词形还原后生成的标记中的标点符号或数字标记？

代表：

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)

text = "It was a day, Tuesday. It wasn't Thursday!"

df <- tibble(text)

spacyr::spacy_initialize(entity = FALSE)

lexicon_features_tokenized_lemmatised <-
  recipe(~ text, data = df%>%head(1)) %>%
  step_tokenize(text, engine = "spacyr") %>%
  step_lemma(text) %>%
  prep() %>%
  bake(new_data = NULL) 

lexicon_features_tokenized_lemmatised %>% pull(text) %>%textrecipes:::get_tokens()

输出： “它”、“是”、“一个”、“日”、“、”、“星期二”、“.”、“它”、“是”、“不是”、“星期四”、“！”

所需的输出（删除“！”，“，”和“.”）： “它”、“是”、“某”、“日”、“星期二”、“它”、“是”、“不是”、“星期四”

Answer 1

如果您想删除标点符号，可以使用

strip_punct

函数中的

step_tokenize

选项，如下所示：

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(spacyr)

text = "It was a day, Tuesday. It wasn't Thursday!"

df <- tibble(text)

spacyr::spacy_initialize(entity = FALSE)
#> successfully initialized (spaCy Version: 3.7.4, language model: en_core_web_sm)

lexicon_features_tokenized_lemmatised <-
  recipe(~ text, data = df%>%head(1)) %>%
  step_tokenize(text, 
                options = list(strip_punct = TRUE)) %>%
  #step_lemma(text) %>%
  prep() %>%
  bake(new_data = NULL) 

lexicon_features_tokenized_lemmatised %>% pull(text) %>% textrecipes:::get_tokens()
#> [[1]]
#> [1] "it"       "was"      "a"        "day"      "tuesday"  "it"       "wasn't"  
#> [8] "thursday"

^{创建于 2024-04-12，使用 reprex v2.0.2}

R Tidymodels textrecipes - 使用 spacyR 进行标记 - 如何从生成的标记列表中删除标点符号

问题描述投票：0回答：1

1个回答

最新问题

R Tidymodels textrecipes - 使用 spacyR 进行标记 - 如何从生成的标记列表中删除标点符号

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1