Lemmatize单词功能不正常

问题描述 投票:0回答:1

我正在尝试进行一些文本挖掘,其主要目的是在这个data.frame中使用下面的单词,但是将它们组合在一起具有相似的根:

+-------------+------+
|    word     | freq |
+-------------+------+
| best        |  897 |
| see         |  768 |
| received    |  701 |
| questions   |  686 |
| contact     |  663 |
| use         |  659 |
| seat        |  643 |
| information |  640 |
| shipping    |  617 |
| help        |  589 |
| want        |  577 |
| discount    |  549 |
| purchase    |  545 |
| code        |  528 |
| team        |  524 |
| sale        |  503 |
| unsubscribe |  460 |
| website     |  426 |
| love        |  414 |
| buy         |  399 |
| ’m          |  394 |
| furniture   |  388 |
| return      |  387 |
| privacy     |  385 |
| looking     |  383 |
| customer    |  382 |
| receive     |  380 |
| fabric      |  375 |
| interested  |  370 |
| delivery    |  348 |
| intended    |  322 |
| ship        |  322 |
| financing   |  314 |
| •           |  314 |
+-------------+------+

最好的例子是receivedreceive。我希望最终结果如下:

+----------+------+
|   word   | freq |
+----------+------+
| best     |  897 |
| see      |  768 |
| received | 1081 |
+----------+------+

所以现在,receivedreceive及其频率总和为一。另外,我如何清理像’m这样的条目?

r text-mining tm lemmatization
1个回答
0
投票

就个人而言,我建议你使用不同的lemmatizer。例如,spaCy提供的那个可以在R中使用,例如,通过使用spacyr

# install.packages("spacyr")
library("spacyr")
# install spacy if running for first time
# spacy_install()
spacy_initialize()
spacy_parse("received and receive")

  doc_id sentence_id token_id    token   lemma   pos entity
1  text1           1        1 received receive  VERB       
2  text1           1        2      and     and CCONJ       
3  text1           1        3  receive receive  VERB       
© www.soinside.com 2019 - 2024. All rights reserved.