我有一系列产品,需要一个少于 40 个字符的特定产品名称。我输入的产品名称是一个字符串列,每个项目的长度超过 40 个字符,因此我需要将其缩短。我可以使用一些字符串方法,但在这种情况下,某些产品名称可能会变成毫无意义的名称。 例如,输入名称可以是“Cut Resistant Gloves,Size 8,Grey/Black - 12 per DZ”(52)。例如,我怎样才能得到“Resistant Size 8 Grey/Black Gloves”(34)? 预先感谢
我想在我的数据框中添加一个新列,其中包含少于 40 个字符的新产品名称。
您可以根据您的需求修改下面实现的逻辑:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(product_name)
shortened_tokens = []
noun_tokens = []
adjective_tokens = []
size_tokens = []
# Iterate over tokens and identify nouns, adjectives, and size/volume information
for token in doc:
if token.pos_ == "NOUN":
noun_tokens.append(token.text)
elif token.pos_ == "ADJ":
adjective_tokens.append(token.text)
elif token.pos_ == "NUM" and token.head.text.lower() in ["size", "vol", "volume"]:
size_tokens.append(token.text)
elif token.lower_ in ["size", "vol", "volume"]:
size_tokens.append(token.text)
# Determine the number of adjectives and nouns to include
num_adjectives = min(len(adjective_tokens), Max_Adj_count) # Initialise Max_Adj_count as the max number of adjectives permissible
num_nouns = min(len(noun_tokens), Max_noun_count) # Initialise Max_Noun_count as the max number of nouns permissible
# Construct the shortened name using specific rules
size_info = " ".join(size_tokens[:1])
shortened_tokens.extend(adjective_tokens[:num_adjectives])
shortened_tokens.extend(size_info.split())
shortened_tokens.extend(noun_tokens[:num_nouns])
shortened_name = " ".join(shortened_tokens)
# If the shortened name is longer than 40 characters, truncate at the nearest word boundary
if len(shortened_name) > 40:
shortened_name = " ".join(shortened_name.split()[:7])