目标是在给定 difflib
\b|$|\s
输出的情况下提取匹配的“单词”(以 SequenceMatcher.get_matching_blocks()
为界),例如给定:
s1 =“HYC00 Schulrucksack Damen,Causal Travel Schultaschen 14 Zoll 笔记本电脑背包,适合青少年 Leichter 背包 Wasserabweisend 书包大学男孩男士工作背包”
s2 =“HYC00 学校背包女式休闲旅行书包 14 英寸笔记本电脑背包适合少女轻型背包防水书包大学男孩男士工作背包”
预期要提取的匹配块是:
['HYC00', 'Causal Travel', '14', 'Laptop', 'Bookbag College Boys Men Work Daypack']
简单的情况是 difflib 中的匹配块立即以
\b|$|\s
为界,例如
import re
from difflib import SequenceMatcher
s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"
s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"
def is_substring_a_phrase(substring, s1):
if substring:
# Check if matching substring is bounded by word boundary.
match = re.findall(rf"\b{substring}(?=\s|$)", s1)
if match:
return match[0]
def matcher(s1, s2):
x = SequenceMatcher(None, s1, s2)
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = s1[m.a:m.a+m.size].strip()
match = is_substring_a_phrase(full_substring, s1)
if match:
yield match
continue
matcher(s1, s2)
[出]:
['14', 'Laptop', 'Bookbag College Boys Men Work Daypack']
然后要捕获
HYC00
和Causal Travel
,匹配块分别是HYC00 Sch
和men, Causual Travel
,所以我们必须做一些“咀嚼”并删除左、右或左和右最部分的“词”,即
def matcher(s1, s2):
x = SequenceMatcher(None, s1, s2)
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = s1[m.a:m.a+m.size].strip()
match = is_substring_a_phrase(full_substring, s1)
if match:
yield match
continue
# Extract the left chomp substring.
left = " ".join(s1[m.a:m.a+m.size].strip().split()[1:])
match = is_substring_a_phrase(left, s1)
if match:
yield match
continue
# Extract the right chomp substring.
right = " ".join(s1[m.a:m.a+m.size].strip().split()[:-1])
match = is_substring_a_phrase(right, s1)
if match:
yield match
continue
# Extract the right chomp substring.
leftright = " ".join(s1[m.a:m.a+m.size].strip().split()[1:-1])
match = is_substring_a_phrase(leftright, s1)
if match:
yield match
continue
matcher(s1, s2)
[出]:
['HYC00',
'Causal Travel',
'14',
'Laptop',
'Bookbag College Boys Men Work Daypack']
虽然上面的代码片段按预期工作,但我的问题分为几部分:
\b|$|\s
界定的匹配块?.get_matching_blocks()
中指定以仅获取由\b|$|\s
界定的部分?来自@megaing的评论
from difflib import SequenceMatcher
s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"
s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"
x = SequenceMatcher(None, s1.split(), s2.split())
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = " ".join(s1.split()[m.a:m.a+m.size])
print(full_substring)
[出]:
HYC00
Causal Travel
14
Laptop
Bookbag College Boys Men Work Daypack