如何在TermDocumentMatrix（）中同时删除罗马数字和阿拉伯数字？

Question

在TermDocumentMatrix()中，参数removeNumbers=TRUE删除英语语料库中的阿拉伯数字。如何删除罗马数字（例如“ iii”，“ xiv”和“ xiii”，以及在任何情况下）和阿拉伯数字？我可以为removeNumbers=TRUE参数提供哪些自定义函数来完成此操作？

我试图理解和修改的代码：

removeNumbers

以下分析显示罗马数字仍然存在，例如“ iii”和“ xii”。

library(gutenbergr)
library(stringr)
library(dplyr)
library(tidyr)

library(tm)
library(topicmodels)
library(tidyverse)
library(tidytext)
library(slam)

titles = c("Wuthering Heights", "A Tale of Two Cities",
  "Alice's Adventures in Wonderland", "The Adventures of Sherlock Holmes")

##read in those books
books = gutenberg_works(title %in% titles) %>%
  gutenberg_download(meta_fields = "title") %>% 
  mutate(document = row_number())

create_chapters = books %>% 
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("\\bchapter\\b", ignore_case = TRUE)))) %>% 
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter) 

by_chapter = create_chapters %>% 
  group_by(document) %>% 
  summarise(text=paste(text,collapse=' '))

import_corpus = Corpus ( VectorSource (by_chapter$text))

no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))]

import_mat = DocumentTermMatrix (import_corpus,
  control = list (stemming = TRUE, #create root words
  stopwords = TRUE, #remove stop words
  minWordLength = 3, #cut out small words
  removeNumbers = no_romans, #take out the numbers
  removePunctuation = TRUE)) #take out punctuation

Answer 1

尝试这些选项。

> st = import_mat$dimnames$Term
> st[grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(st))]
 [1] "cli"    "iii"    "mix"    "vii"    "viii"   "xii"    "xiii"   "xiv"   
 [9] "xix"    "xvi"    "xvii"   "xviii"  "xxi"    "xxii"   "xxiii"  "xxiv"  
[17] "xxix"   "xxv"    "xxvi"   "xxvii"  "xxviii" "xxx"    "xxxi"   "xxxii" 
[25] "xxxiii" "xxxiv"

Gregor的警告之一-“ I”-似乎不存在，因此我们现在不必担心。 Gregor的另一个注意事项是单词[[“ mix”，它既是合法数字又是罗马数字。删除简单/整体罗马数字的基本功能可能是：

library(tm) dat <- VCorpus(VectorSource(c("iv. Chapter Four", "I really want to discuss the proper mix of 17 ingredients.", "Nothing to remove here."))) inspect( DocumentTermMatrix(dat) ) # <<DocumentTermMatrix (documents: 3, terms: 13)>> # Non-/sparse entries: 13/26 # Sparsity : 67% # Maximal term length: 12 # Weighting : term frequency (tf) # Sample : # Terms # Docs chapter discuss four here. ingredients. iv. mix nothing proper really # 1 1 0 1 0 0 1 0 0 0 0 # 2 0 1 0 0 1 0 1 0 1 1 # 3 0 0 0 1 0 0 0 1 0 0
除去no_romans <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$", toupper(s))] inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans)) ) # <<DocumentTermMatrix (documents: 3, terms: 12)>> # Non-/sparse entries: 12/24 # Sparsity : 67% # Maximal term length: 12 # Weighting : term frequency (tf) # Sample : # Terms # Docs chapter discuss four here. ingredients. iv. nothing proper really remove # 1 1 0 1 0 0 1 0 0 0 0 # 2 0 1 0 0 1 0 0 1 1 0 # 3 0 0 0 1 0 0 1 0 0 1，但留下"mix"。如果您需要删除它，那么也许
"iv."
（（唯一的区别是在正则表达式的末尾添加了no_romans2 <- function(s) s[!grepl("^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})[.]?$", toupper(s))] inspect( DocumentTermMatrix(dat, control = list(removeNumbers = no_romans2)) ) # <<DocumentTermMatrix (documents: 3, terms: 11)>> # Non-/sparse entries: 11/22 # Sparsity : 67% # Maximal term length: 12 # Weighting : term frequency (tf) # Sample : # Terms # Docs chapter discuss four here. ingredients. nothing proper really remove the # 1 1 0 1 0 0 0 0 0 0 0 # 2 0 1 0 0 1 0 1 1 0 1 # 3 0 0 0 1 0 1 0 0 1 0。）
（顺便说一句：可以使用[.]?来获得与此处使用的grepl(..., ignore.case=TRUE)相同的效果。在小样本测试中，它的速度稍慢，但效果相同。]

如何在TermDocumentMatrix（）中同时删除罗马数字和阿拉伯数字？

问题描述投票：0回答：1

1个回答

最新问题

如何在TermDocumentMatrix（）中同时删除罗马数字和阿拉伯数字？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1