Python 中文本查找和替换的问题

问题描述 投票:0回答:1

我有非常具体的功能。我有 2 个字符串,一个是代码输入的备份,第二个是通过替换空格、提取信息等步骤进行修改的(对于这种情况并不重要)。

即使第一个字符串被修改,我也需要在这些字符串中找到匹配项。找到匹配项后,我需要存储原始字符串中的匹配项(无需修改),并将其从“sub_str”/“modified_sub_str”中删除。

def find_and_save(sub_str, main_str):
    # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

    # Use re.search() to find the substring in the modified main string
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

        # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            # Remove the matching part from sub_str in a case-insensitive manner
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

但是我对这段代码有一个特定的问题。例如,如果我有类似的输入

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"

main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]" 

此代码可以找到匹配项,可以返回“original_sub_str”,但无法从“modified_sub_str”中删除匹配项。

这些输入也存在同样的问题: “子_str” - “主_str”

"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"

"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”" 
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"

"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]" 

即使使用人工智能我也找不到解决方案,但我知道替换功能、独特符号、区分大小写存在问题。

python replace extract text-mining
1个回答
0
投票

您已修改 find_and_save 函数以提高匹配准确性。

def find_and_save(sub_str, main_str):
   
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

 
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

       
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"

modified_sub_str, original_sub_str = find_and_save(sub_str, main_str)
print("Modified Substring:", modified_sub_str)
print("Original Substring:", original_sub_str)
© www.soinside.com 2019 - 2024. All rights reserved.