r中的定标函数

问题描述 投票:0回答:1

corpus提供了自定义的词干提取功能。当给定术语一项作为输入时,词干函数应返回该术语的词干作为输出。

[从Stemming Words开始,我举了下面的示例,它使用hunspell字典进行词干分析。

首先,我定义测试该功能的句子:

sentences<-c("The color blue neutralizes orange yellow reflections.", 
             "Zod stabbed me with blue Kryptonite.", 
             "Because blue is your favourite colour.",
             "Red is wrong, blue is right.",
             "You and I are going to yellowstone.",
             "Van Gogh looked for some yellow at sunset.",
             "You ruined my beautiful green dress.",
             "You do not agree.",
             "There's nothing wrong with green.")

自定义词干函数是:

stem_hunspell <- function(term) {
  # look up the term in the dictionary
  stems <- hunspell::hunspell_stem(term)[[1]]

  if (length(stems) == 0) { # if there are no stems, use the original term
    stem <- term
  } else { # if there are multiple stems, use the last one
    stem <- stems[[length(stems)]]
  }

  stem
}

此代码

sentences=text_tokens(sentences, stemmer = stem_hunspell)

产生:

> sentences
[[1]]
[1] "the"        "color"      "blue"       "neutralize" "orange"     "yellow"    
[7] "reflection" "."         

[[2]]
[1] "zod"        "stabbed"    "me"         "with"       "blue"       "kryptonite"
[7] "."         

[[3]]
[1] "because"   "blue"      "i"         "your"      "favourite" "colour"   
[7] "."        

[[4]]
[1] "re"    "i"     "wrong" ","     "blue"  "i"     "right" "."    

[[5]]
[1] "you"         "and"         "i"           "are"         "go"         
[6] "to"          "yellowstone" "."          

[[6]]
[1] "van"    "gogh"   "look"   "for"    "some"   "yellow" "at"     "sunset" "."     

[[7]]
[1] "you"       "ruin"      "my"        "beautiful" "green"     "dress"    
[7] "."        

[[8]]
[1] "you"   "do"    "not"   "agree" "."    

[[9]]
[1] "there"   "nothing" "wrong"   "with"    "green"   "." 

词干后,我想对文本应用其他操作,例如删除停用词。无论如何,当我应用tm功能时:

removeWords(sentences,stopwords)

我的句子中,出现以下错误:

Error in UseMethod("removeWords", x) : 
 no applicable method for 'removeWords' applied to an object of class "list"

如果我使用

unlist(sentences)

我没有得到理想的结果,因为我最终得到65个元素的chr。期望的结果应该是(例如,对于第一句话):

"the color blue neutralize orange yellow reflection."       
r text stemming
1个回答
0
投票

如果要从每个sentence中删除停用词,则可以使用lapply

lapply(sentences, removeWords, stopwords())

#[[1]]
#[1] ""           "color"      "blue"       "neutralize" "orange"     "yellow"     "reflection" "."         

#[[2]]
#[1] "zod"        "stabbed"    ""           ""           "blue"       "kryptonite" "."  
#...
#...

但是,从您的预期输出中,您似乎希望将文本粘贴在一起。

lapply(sentences, paste0, collapse = " ")
© www.soinside.com 2019 - 2024. All rights reserved.