用 Python 模糊搜索标记 persNames

Question

我必须使用 XML 文件 - 在第一个文件（“aix_xml_raw”）中，人名已经用 persName 标记。我有第二个文件（“wien_xml_raw”），文本相同，但拼写不同，而且还有一些新的文本段落。我想通过模糊搜索从第二个文档中的第一个文档中找到 persName-Elements 的所有值（例如，第一个文档中的“mr.l Conte de Sle”也将匹配“mr.le C. de Sli。 " 在第二个文档中）并用 persName 标记它。我有两个解决方案，如果您执行没有模糊事物的 if 条件的代码，它们都可以工作。为什么它不适用于模糊匹配？


aix_xml_raw = """
    <doc><p>Louërent Dieu, mais <persName>mr. l C<ex>onte</ex> de Sle.</persName> ne voulut pas
        appliquer mon voyage a mon avantage il crut que cela ressembloit fort a l’avanture,
        et que la peur de me confesser au <persName>RP. Br.</persName> m’avoit fait aller a 
        <placeName>Hilzing</placeName> 
        Je ne m’excuse point, laissant au jugement de ceux qui liront ces lignes </p>
        <gap/> 
        <p>Votre Saint Nom en soit béni, loué et glorifié. Amen.</p></doc>
        """
wien_xml_raw = """
    <doc>
    <line>louerent Dieu, mais mr. le C. de Sli. ne voulut  pas appliquer mon voyage a mon avantage 
    il crut que cela rasembloit fort a l’avanture, et que la peur de me confesser au RP. Br.
    m’avoit fait aller a Hitzing Je ne m’excuse pas sur ce point, 
    laissant au jugement de ceux qui liront ces ligne votre st nom en soitloué, et glorifié, amen.</line>
    </doc>
"""

解决方案一：

from bs4 import BeautifulSoup, Tag
from fuzzywuzzy import fuzz

# Parse the first document
soup1 = BeautifulSoup(aix_xml_raw, 'xml')

# Find all persName tags and extract their values
pers_names = [tag.text for tag in soup1.find_all('persName')]

# Parse the second document
soup2 = BeautifulSoup(wien_xml_raw, 'xml')

# Find all text nodes in the second document
text_nodes = soup2.find_all(text=True)

# Loop over each text node and replace fuzzy matches with tagged values
for node in text_nodes:
    for name in pers_names:
        if fuzz.token_sort_ratio(name, node.strip()) > 90:
            # Create a new persName tag and insert it before the found value
            new_tag = Tag(name='persName')
            new_tag.string = name
            node.replace_with(node.replace(name, str(new_tag)))

# Print the modified second document
print(soup2.prettify())

解决方案 2：

import difflib
import xml.etree.ElementTree as ET

# define a function to get the person names from the first xml document
def get_person_names(xml_str):
    person_names = []
    root = ET.fromstring(xml_str)
    for pers_name in root.iter('persName'):
        person_names.append(pers_name.text.strip())
    return person_names

# define a function to find and tag person names in the second xml document
def tag_person_names(xml_str, person_names):
    root = ET.fromstring(xml_str)
    for line in root.iter('line'):
        tagged_line = line.text
        for name in person_names:
            # perform fuzzy string matching and tag the person names if found
            if difflib.SequenceMatcher(None, name.lower(), line.text.lower()).ratio() >= 0.8:
                tagged_line = tagged_line.replace(name, '<persName>{}</persName>'.format(name))
        line.text = tagged_line
    return ET.tostring(root, encoding='unicode')

person_names = get_person_names(aix_xml_raw)
tagged_wien_xml_raw = tag_person_names(wien_xml_raw, person_names)
print(tagged_wien_xml_raw)

用 Python 模糊搜索标记 persNames

问题描述投票：0回答：0

最新问题

用 Python 模糊搜索标记 persNames

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0