足够相似时返回字符串差异

Question

我正在努力从产品描述中分离出变体以进行分组练习。

对产品描述进行排序，然后与最相似的邻居进行比较，并消除差异。

我想添加的是只有在它们至少有 n 个类似单词（可能是下面示例中使用的 3 个）时才能消除差异。

以下是输入示例：

产品说明
Petzl 红色 HMS 登山扣
Petzl 蓝色 HMS 登山扣
Petzl HMS 登山扣橙色
Petzl 绿色登山扣
Petzl 紫色登山扣
液体粉笔 - 100ml
液体粉笔 - 100 毫升（10 件装）

所需输出：

产品说明	方差
Petzl HMS 登山扣	红色
Petzl HMS 登山扣	蓝色
Petzl HMS 登山扣	橙色
Petzl 绿色登山扣	NaN
Petzl 紫色登山扣	NaN
液体粉笔 - 100ml	NaN
液体粉笔 - 100ml	（10个的情况）

这是我目前正在使用的，但没有匹配单词计数的过滤器：

def get_intersection(descr1, descr2):
    if pd.isna(descr1) or pd.isna(descr2):
        return set()
    return set(descr1.split()).intersection(set(descr2.split()))

def get_unique_words(descr, intersection):
    unique_words = " ".join(
        word for word in descr.split() if word not in intersection
    )
    if len(unique_words) > 0:
        return unique_words

def get_unique_description(row):
    if len(row["next_product_intersection"]) == 0 and len(row["prev_product_intersection"]) == 0:
        return row["Product Description"]
    
    if len(row["next_product_intersection"]) >= len(row["prev_product_intersection"]):
        return row["next_product_unique_words"]
    
    return row["prev_product_unique_words"]


df["next_product"] = df["Product Description"].shift(-1)
df["prev_product"] = df["Product Description"].shift(1)

df["next_product_intersection"] = df.apply(
    lambda row: get_intersection(row["Product Description"], row["next_product"]),
    axis=1
)
df["prev_product_intersection"] = df.apply(
    lambda row: get_intersection(row["Product Description"], row["prev_product"]),
    axis=1
)

df["next_product_unique_words"] = df.apply(
    lambda row: get_unique_words(row["Product Description"], row["next_product_intersection"]),
    axis=1
)
df["prev_product_unique_words"] = df.apply(
    lambda row: get_unique_words(row["Product Description"], row["prev_product_intersection"]),
    axis=1
)

df["Variance"] = df.apply(get_unique_description, axis=1)
df = df[["Product Description", "Variance"]]
print(df)

如何将此过滤器添加到该框架中？

提前谢谢您。

Answer 1

您可以在下面尝试，但正如@mozway所示，很容易找到反例！

from difflib import SequenceMatcher
from itertools import chain, permutations

def fn(x, y, N=3):
    matches = SequenceMatcher(None, x , y).get_matching_blocks()[:-1]
    mchunks = [x[m.a: m.a + m.size] for m in matches]
    if len(list(chain.from_iterable(mchunks))) >= N:
        return mchunks
    
descs = df["Product Description"].str.split(r"\s+(?![^[\(]*\))")

d = {" ".join(s1): fn(s1, s2) for s1, s2 in
     permutations(descs, r=2) if fn(s1, s2)}

_map = (df["Product Description"].map(d).fillna("")
            .apply(lambda x: list(chain(*x))))

df["Variance"] = [set(des).difference(m).pop()
                  if m and set(des).difference(m) else pd.NA
                  for des, m in zip(descs, _map)]

输出：

print(df)

                 Product Description      Variance
0            Petzl Red HMS Carabiner           Red
1           Petzl Blue HMS Carabiner          Blue
2         Petzl HMS Carabiner Orange        Orange
3              Petzl Green Carabiner          <NA>
4             Petzl Purple Carabiner          <NA>
5               Liquid Chalk - 100ml          <NA>
6  Liquid Chalk - 100ml (Case of 10)  (Case of 10)

足够相似时返回字符串差异

问题描述投票：0回答：1

1个回答

最新问题

足够相似时返回字符串差异

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1