通过空格剪切左、右字符串以迭代正则表达式匹配

问题描述 投票:0回答:1

目标是在给定 difflib

\b|$|\s
输出的情况下提取匹配的“单词”(以
SequenceMatcher.get_matching_blocks()
为界),例如给定:

s1 =“HYC00 Schulrucksack Damen,Causal Travel Schultaschen 14 Zoll 笔记本电脑背包,适合青少年 Leichter 背包 Wasserabweisend 书包大学男孩男士工作背包”

s2 =“HYC00 学校背包女式休闲旅行书包 14 英寸笔记本电脑背包适合少女轻型背包防水书包大学男孩男士工作背包”

预期要提取的匹配块是:

['HYC00', 'Causal Travel', '14', 'Laptop', 'Bookbag College Boys Men Work Daypack']

简单的情况是 difflib 中的匹配块立即以

\b|$|\s
为界,例如

import re
from difflib import SequenceMatcher

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"

def is_substring_a_phrase(substring, s1):
  if substring:
    # Check if matching substring is bounded by word boundary.
    match = re.findall(rf"\b{substring}(?=\s|$)", s1)
    if match: 
      return match[0]

def matcher(s1, s2):
  x = SequenceMatcher(None, s1, s2)
  for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = s1[m.a:m.a+m.size].strip()
    match = is_substring_a_phrase(full_substring, s1)
    if match:
      yield match
      continue

matcher(s1, s2)

[出]:

['14', 'Laptop', 'Bookbag College Boys Men Work Daypack']

然后要捕获

HYC00
Causal Travel
,匹配块分别是
HYC00 Sch
men, Causual Travel
,所以我们必须做一些“咀嚼”并删除左、右或左和右最部分的“词”,即

def matcher(s1, s2):
  x = SequenceMatcher(None, s1, s2)
  for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = s1[m.a:m.a+m.size].strip()
    match = is_substring_a_phrase(full_substring, s1)
    if match:
      yield match
      continue

    # Extract the left chomp substring.
    left = " ".join(s1[m.a:m.a+m.size].strip().split()[1:])
    match = is_substring_a_phrase(left, s1)
    if match:
      yield match
      continue


    # Extract the right chomp substring.
    right = " ".join(s1[m.a:m.a+m.size].strip().split()[:-1])
    match = is_substring_a_phrase(right, s1)
    if match:
      yield match
      continue


    # Extract the right chomp substring.
    leftright = " ".join(s1[m.a:m.a+m.size].strip().split()[1:-1])
    match = is_substring_a_phrase(leftright, s1)
    if match:
      yield match
      continue

matcher(s1, s2)

[出]:

['HYC00',
 'Causal Travel',
 '14',
 'Laptop',
 'Bookbag College Boys Men Work Daypack']

虽然上面的代码片段按预期工作,但我的问题分为几部分:

  • 是否有某种方法可以避免各种 chomp 和多个 if-else 的重复代码来提取由
    \b|$|\s
    界定的匹配块?
  • 有没有直接的方法在
    .get_matching_blocks()
    中指定以仅获取由
    \b|$|\s
    界定的部分?
  • 是否有其他方法可以实现相同的目标而不以这种混乱的方式使用 get_matching_blocks ?
python string substring difflib
1个回答
1
投票

来自@megaing的评论

from difflib import SequenceMatcher

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"


x = SequenceMatcher(None, s1.split(), s2.split())

for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = " ".join(s1.split()[m.a:m.a+m.size])
    print(full_substring)

[出]:

HYC00
Causal Travel
14
Laptop
Bookbag College Boys Men Work Daypack
© www.soinside.com 2019 - 2024. All rights reserved.