我对实体识别相对较新,但我一直在使用这个有用的指南,并且我有一个拉丁美洲国家国会政策辩论的大型文本语料库翻译成英文。
我的目标是探索哪些国会议员最常提及名为“NAFTA”的特定自由贸易协定以及他们的政治立场。然后,我将按政治派别分析对此协议的情绪或观点,但我现在正在执行第一项任务,我不确定实体识别是否会有帮助。 主要政党缩写为“WP”、“PRD”和“PAN”
这是我当前的尝试:
#Install and load required packages
# install.packages("pdftools")
# install.packages("spacyr")
library(pdftools)
library("spacyr")
library(quanteda)
library(dplyr)
# spacy_initialize(model = "en_core_web_sm")
## successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)
pdf_files <- c("df1.pdf", "df2.pdf", "df3.pdf")
#Function to extract text from PDFs
extract_text_from_pdf <- function(pdf_files) {
texts <- lapply(pdf_files, function(file) {
# Extract text from each PDF file
pdf_text(file)
})
return(unlist(texts))
}
# Extract text from PDFs
pdf_texts <- extract_text_from_pdf(pdf_files)
# Process text using SpaCy
parsed_texts <- spacy_parse(pdf_texts)
parsed_texts
这是一个数据示例:
dput(parsed_texts[1:25,(1:7)])
输出:
structure(list(doc_id = c("text1", "text1", "text1", "text1",
"text1", "text1", "text1", "text1", "text1", "text1", "text1",
"text1", "text1", "text1", "text1", "text1", "text1", "text1",
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), token_id = 1:25,
token = c("12/19/23", ",", "4:41", "PM", " ",
"about", ":", "blank", "\n\n\n\n ", "Parliament", "No", ":",
" ", "14", "\n\n ", "Session", "No", ":", " ",
"1", "\n\n ", "Volume", "No", ":", " "),
lemma = c("12/19/23", ",", "4:41", "pm", " ",
"about", ":", "blank", "\n\n\n\n ", "Parliament", "no", ":",
" ", "14", "\n\n ", "Session", "no", ":", " ",
"1", "\n\n ", "volume", "no", ":", " "),
pos = c("NUM", "PUNCT", "NUM", "NOUN", "SPACE", "ADP", "PUNCT",
"ADJ", "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM",
"SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", "SPACE",
"NOUN", "NOUN", "PUNCT", "SPACE"), entity = c("CARDINAL_B",
"", "TIME_B", "TIME_I", "", "", "", "ORG_B", "ORG_I", "ORG_I",
"", "", "", "", "", "", "", "", "", "CARDINAL_B", "", "",
"", "", "")), row.names = c(NA, 25L), class = c("spacyr_parsed",
"data.frame"))
理想情况下,我希望结果显示如下,而目前政党和贸易协定都位于我的 df 中的“实体”列下方。
token entity political affiliation congress_member_share_of_NAFTA_mentions
Rafael NAFTA WP 3%
Martinez NAFTA WP 7%
Martinez NAFTA WP 7%
Alberto NAFTA PAN 36%
Alberto NAFTA PAN 36%
Rafael NAFTA PAN 24%
Rafael NAFTA PAN 24%
Alberto NAFTA PAN 36%
我认为实体识别不适合解决这个问题。您应该对数据进行一些分析,以了解当人们谈论 NAFTA 时它是如何出现的。
当有一类词需要自动标记时,就会使用实体识别。例如,在您的分析中,在分析了 NAFTA 的实例后,您可以更进一步查看哪些[贸易商品]正在与“NAFTA”一词一起讨论 - 例如“石油”、“木材”、“纸张”等等...因此,您可以创建一个自定义实体来标记贸易商品,进行一些手动培训以使其运行,然后手动词干/组合术语 - 例如“石油”、“天然气”和“石油”可能是一样的东西。也可以使用法学硕士 - 为您标记一个句子。