我正在努力从产品描述中分离出变体以进行分组练习。
对产品描述进行排序,然后与最相似的邻居进行比较,并消除差异。
我想添加的是只有在它们至少有 n 个类似单词(可能是下面示例中使用的 3 个)时才能消除差异。
以下是输入示例:
产品说明 |
---|
Petzl 红色 HMS 登山扣 |
Petzl 蓝色 HMS 登山扣 |
Petzl HMS 登山扣橙色 |
Petzl 绿色登山扣 |
Petzl 紫色登山扣 |
液体粉笔 - 100ml |
液体粉笔 - 100 毫升(10 件装) |
所需输出:
产品说明 | 方差 |
---|---|
Petzl HMS 登山扣 | 红色 |
Petzl HMS 登山扣 | 蓝色 |
Petzl HMS 登山扣 | 橙色 |
Petzl 绿色登山扣 | NaN |
Petzl 紫色登山扣 | NaN |
液体粉笔 - 100ml | NaN |
液体粉笔 - 100ml | (10个的情况) |
这是我目前正在使用的,但没有匹配单词计数的过滤器:
def get_intersection(descr1, descr2):
if pd.isna(descr1) or pd.isna(descr2):
return set()
return set(descr1.split()).intersection(set(descr2.split()))
def get_unique_words(descr, intersection):
unique_words = " ".join(
word for word in descr.split() if word not in intersection
)
if len(unique_words) > 0:
return unique_words
def get_unique_description(row):
if len(row["next_product_intersection"]) == 0 and len(row["prev_product_intersection"]) == 0:
return row["Product Description"]
if len(row["next_product_intersection"]) >= len(row["prev_product_intersection"]):
return row["next_product_unique_words"]
return row["prev_product_unique_words"]
df["next_product"] = df["Product Description"].shift(-1)
df["prev_product"] = df["Product Description"].shift(1)
df["next_product_intersection"] = df.apply(
lambda row: get_intersection(row["Product Description"], row["next_product"]),
axis=1
)
df["prev_product_intersection"] = df.apply(
lambda row: get_intersection(row["Product Description"], row["prev_product"]),
axis=1
)
df["next_product_unique_words"] = df.apply(
lambda row: get_unique_words(row["Product Description"], row["next_product_intersection"]),
axis=1
)
df["prev_product_unique_words"] = df.apply(
lambda row: get_unique_words(row["Product Description"], row["prev_product_intersection"]),
axis=1
)
df["Variance"] = df.apply(get_unique_description, axis=1)
df = df[["Product Description", "Variance"]]
print(df)
如何将此过滤器添加到该框架中?
提前谢谢您。
您可以在下面尝试,但正如@mozway所示,很容易找到反例!
from difflib import SequenceMatcher
from itertools import chain, permutations
def fn(x, y, N=3):
matches = SequenceMatcher(None, x , y).get_matching_blocks()[:-1]
mchunks = [x[m.a: m.a + m.size] for m in matches]
if len(list(chain.from_iterable(mchunks))) >= N:
return mchunks
descs = df["Product Description"].str.split(r"\s+(?![^[\(]*\))")
d = {" ".join(s1): fn(s1, s2) for s1, s2 in
permutations(descs, r=2) if fn(s1, s2)}
_map = (df["Product Description"].map(d).fillna("")
.apply(lambda x: list(chain(*x))))
df["Variance"] = [set(des).difference(m).pop()
if m and set(des).difference(m) else pd.NA
for des, m in zip(descs, _map)]
输出:
print(df)
Product Description Variance
0 Petzl Red HMS Carabiner Red
1 Petzl Blue HMS Carabiner Blue
2 Petzl HMS Carabiner Orange Orange
3 Petzl Green Carabiner <NA>
4 Petzl Purple Carabiner <NA>
5 Liquid Chalk - 100ml <NA>
6 Liquid Chalk - 100ml (Case of 10) (Case of 10)