我有非常具体的功能。我有 2 个字符串,一个是代码输入的备份,第二个是通过替换空格、提取信息等步骤进行修改的(对于这种情况并不重要)。
即使第一个字符串被修改,我也需要在这些字符串中找到匹配项。找到匹配项后,我需要存储原始字符串中的匹配项(无需修改),并将其从“sub_str”/“modified_sub_str”中删除。
def find_and_save(sub_str, main_str):
# Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")
# Use re.search() to find the substring in the modified main string
match = re.search(sub_str_mod, main_str_mod)
if match:
start = match.start()
end = match.end()
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
# If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
modified_sub_str = ""
else:
# Remove the matching part from sub_str in a case-insensitive manner
modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)
return modified_sub_str, original_sub_str # Returns the modified sub_str and the matched string in its original form
else:
return sub_str, None # Returns sub_str as it was and None if no match is found
但是我对这段代码有一个特定的问题。例如,如果我有类似的输入
sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
和
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"
此代码可以找到匹配项,可以返回“original_sub_str”,但无法从“modified_sub_str”中删除匹配项。
这些输入也存在同样的问题: “子_str” - “主_str”
"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"
"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”"
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"
"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]"
即使使用人工智能我也找不到解决方案,但我知道替换功能、独特符号、区分大小写存在问题。
您已修改 find_and_save 函数以提高匹配准确性。
def find_and_save(sub_str, main_str):
sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")
match = re.search(sub_str_mod, main_str_mod)
if match:
start = match.start()
end = match.end()
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
modified_sub_str = ""
else:
modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)
return modified_sub_str, original_sub_str
else:
return sub_str, None # Returns sub_str as it was and None if no match is found
sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"
modified_sub_str, original_sub_str = find_and_save(sub_str, main_str)
print("Modified Substring:", modified_sub_str)
print("Original Substring:", original_sub_str)