我正在尝试进行一些文本挖掘,其主要目的是在这个data.frame
中使用下面的单词,但是将它们组合在一起具有相似的根:
+-------------+------+
| word | freq |
+-------------+------+
| best | 897 |
| see | 768 |
| received | 701 |
| questions | 686 |
| contact | 663 |
| use | 659 |
| seat | 643 |
| information | 640 |
| shipping | 617 |
| help | 589 |
| want | 577 |
| discount | 549 |
| purchase | 545 |
| code | 528 |
| team | 524 |
| sale | 503 |
| unsubscribe | 460 |
| website | 426 |
| love | 414 |
| buy | 399 |
| ’m | 394 |
| furniture | 388 |
| return | 387 |
| privacy | 385 |
| looking | 383 |
| customer | 382 |
| receive | 380 |
| fabric | 375 |
| interested | 370 |
| delivery | 348 |
| intended | 322 |
| ship | 322 |
| financing | 314 |
| • | 314 |
+-------------+------+
最好的例子是received
和receive
。我希望最终结果如下:
+----------+------+
| word | freq |
+----------+------+
| best | 897 |
| see | 768 |
| received | 1081 |
+----------+------+
所以现在,received
和receive
及其频率总和为一。另外,我如何清理像’m
和•
这样的条目?
就个人而言,我建议你使用不同的lemmatizer。例如,spaCy
提供的那个可以在R
中使用,例如,通过使用spacyr
:
# install.packages("spacyr")
library("spacyr")
# install spacy if running for first time
# spacy_install()
spacy_initialize()
spacy_parse("received and receive")
doc_id sentence_id token_id token lemma pos entity
1 text1 1 1 received receive VERB
2 text1 1 2 and and CCONJ
3 text1 1 3 receive receive VERB