使用 R(spacy) 中的实体识别来匹配实体和个人

问题描述 投票:0回答:1

我对实体识别相对较新,但我一直在使用这个有用的指南,并且我有一个拉丁美洲国家国会政策辩论的大型文本语料库翻译成英文。

我的目标是探索哪些国会议员最常提及名为“NAFTA”的特定自由贸易协定以及他们的政治立场。然后,我将按政治派别分析对此协议的情绪或观点,但我现在正在执行第一项任务,我不确定实体识别是否会有帮助。 主要政党缩写为“WP”、“PRD”和“PAN”

这是我当前的尝试:

#Install and load required packages
# install.packages("pdftools")
# install.packages("spacyr")
library(pdftools)
library("spacyr")
library(quanteda)
library(dplyr)
# spacy_initialize(model = "en_core_web_sm")
## successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)
pdf_files <- c("df1.pdf", "df2.pdf", "df3.pdf")
#Function to extract text from PDFs
extract_text_from_pdf <- function(pdf_files) {
  texts <- lapply(pdf_files, function(file) {
    # Extract text from each PDF file
    pdf_text(file)
  })
  return(unlist(texts))
}

# Extract text from PDFs
pdf_texts <- extract_text_from_pdf(pdf_files)

# Process text using SpaCy
parsed_texts <- spacy_parse(pdf_texts)
parsed_texts

这是一个数据示例:

dput(parsed_texts[1:25,(1:7)])

输出:

structure(list(doc_id = c("text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1", 
"text1", "text1", "text1", "text1", "text1", "text1", "text1"
), sentence_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), token_id = 1:25, 
    token = c("12/19/23", ",", "4:41", "PM", "                                                ", 
    "about", ":", "blank", "\n\n\n\n ", "Parliament", "No", ":", 
    "               ", "14", "\n\n ", "Session", "No", ":", "                  ", 
    "1", "\n\n ", "Volume", "No", ":", "                   "), 
    lemma = c("12/19/23", ",", "4:41", "pm", "                                                ", 
    "about", ":", "blank", "\n\n\n\n ", "Parliament", "no", ":", 
    "               ", "14", "\n\n ", "Session", "no", ":", "                  ", 
    "1", "\n\n ", "volume", "no", ":", "                   "), 
    pos = c("NUM", "PUNCT", "NUM", "NOUN", "SPACE", "ADP", "PUNCT", 
    "ADJ", "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", 
    "SPACE", "PROPN", "NOUN", "PUNCT", "SPACE", "NUM", "SPACE", 
    "NOUN", "NOUN", "PUNCT", "SPACE"), entity = c("CARDINAL_B", 
    "", "TIME_B", "TIME_I", "", "", "", "ORG_B", "ORG_I", "ORG_I", 
    "", "", "", "", "", "", "", "", "", "CARDINAL_B", "", "", 
    "", "", "")), row.names = c(NA, 25L), class = c("spacyr_parsed", 
"data.frame"))

理想情况下,我希望结果显示如下,而目前政党和贸易协定都位于我的 df 中的“实体”列下方。

token    entity       political affiliation      congress_member_share_of_NAFTA_mentions   
Rafael   NAFTA            WP                               3%
Martinez   NAFTA          WP                               7%
Martinez   NAFTA          WP                               7%
Alberto   NAFTA          PAN                               36%
Alberto   NAFTA          PAN                               36%
Rafael   NAFTA          PAN                                24%
Rafael   NAFTA          PAN                                24%
Alberto   NAFTA          PAN                               36%
r dplyr deep-learning spacy quanteda
1个回答
-1
投票

我认为实体识别不适合解决这个问题。您应该对数据进行一些分析,以了解当人们谈论 NAFTA 时它是如何出现的。

  • 如果他们曾经引用过文本“NAFTA”(这很可能),那么只需简单地计算出现的次数即可。你很容易做到!
  • 如果有少数方式被引用(比如它被缩写为“NAF”或有些人称之为“北美自由贸易协定”和“北美自由贸易协定”,我只是记录所有这些(因为可能只有 5-6 种方式可以引用)并计算出现的次数。

当有一类词需要自动标记时,就会使用实体识别。例如,在您的分析中,在分析了 NAFTA 的实例后,您可以更进一步查看哪些[贸易商品]正在与“NAFTA”一词一起讨论 - 例如“石油”、“木材”、“纸张”等等...因此,您可以创建一个自定义实体来标记贸易商品,进行一些手动培训以使其运行,然后手动词干/组合术语 - 例如“石油”、“天然气”和“石油”可能是一样的东西。也可以使用法学硕士 - 为您标记一个句子。

© www.soinside.com 2019 - 2024. All rights reserved.