我计划按照“交易”与“法律”逻辑,用自己的自定义词典在R中进行文本分析,就像情感分析一样。
我在excel文件中具有字典的所有必需单词。看起来像这样:
> % 1 Trade 2 Law % business 1 exchange 1 industry 1 rule 2
> settlement 2 umpire 2 court 2 tribunal 2 lawsuit 2 bench 2
> courthouse 2 courtroom 2
为了将其转换为适合R的格式并将其应用于我的文本语料库,我必须采取什么步骤?
谢谢您的帮助!
创建具有两列的data.frame,并将其存储为rds,数据库对象或excel。因此,您可以在需要时随时加载它。
一旦您将数据保存在data.frame中,就可以使用联接/字典将其与文本语料库中的单词进行匹配。在评分data.frame中,我使用1和2表示扇区,但是您也可以使用单词。
请参见使用tidytext的示例,但请阅读情感分析并使用所需的任何软件包。
library(tidytext)
library(dplyr)
text_df <- data.frame(id = 1:2,
text = c("The business is in the mining industry and has a settlement.",
"The court ordered the business owner to settle the lawsuit."))
text_df %>%
unnest_tokens(word, text) %>%
inner_join(my_scoring_df)
Joining, by = "word"
id word sector
1 1 business 1
2 1 industry 1
3 1 settlement 2
4 2 court 2
5 2 business 1
6 2 lawsuit 2
数据:
my_scoring_df <- structure(list(word = c("business", "exchange", "industry", "rule",
"settlement", "umpire", "court", "tribunal", "lawsuit", "bench",
"courthouse", "courtroom"), sector = c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-12L))