在Python中匹配大型文本语料库中数百个不相邻的关键字

Question

我需要在大量文本（数千个）中匹配不相邻的关键字。如果匹配，则分配一个标签，否则分配一个标签“未知”。

举个例子，我想在下面的文本片段中找到关键字销售代表和交易，并将其分配给类别关键字模式A：

文本：“销售代表处理了一切。知道他为我编制了最佳选择，这非常有帮助。”

关键字模式因此是销售代表和交易

由于销售代表也可能称为销售代表或客户代表，因此我需要匹配多个关键字。对于所处理的词也是如此。所以你看到事情变得复杂的地方了。

有许多解决方案可用于查找和匹配一元词或相邻词（n 元词）。我自己已经实现了这个。现在我需要找到不相邻的不同关键字并分配标签。另外，我不知道不同关键字之间写了什么。它可以是任何东西。

我正在使用词汇方法来解决这个问题，在具有不同列的字典中查找关键字，以适应单个关键字、两个关键字或三个关键字的匹配。请注意，关键字始终是一元组或二元组。另外，我不知道关键字之间写了什么。下面是我写的一些代码。

import pandas as pd 

#creat mock dictionary
Dict = pd.DataFrame({'word1':['dealt','dealt','dealt',''],
                     'word2':['sales representative','sales rep', 'customer rep', 'options']
                      }  )

#create sample text 
texts = ["The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything."]

motive =[]
# only checks for the keyword in the first column  
for item in texts:
    item = str(item)
    if any(x in item for x in Dict['word1']):
    motive.append('keyword pattern A')        
    else:
        motive.append('unkown')

仅当文本中出现 dealt 和 salesrep 时才应分配标签。因此句子 3 和 5 的分配是错误的。所以我更新了代码。我跑完了，但没有分配任何标签。

for item in texts:
    #convert into string
    item = str(item)
    #check if keyword can be found in first column
    tempM1 = {x for x in Dict['word1'] if x in item}
    #check if keyword was found
    if tempM1 != None:
        #if yes, locate all of their positions in the dictionary 
        for i in tempM1:
            i = -1
            #get row index 
            ind = Dict.index[Dict['word1'] == list(tempM1)[i+1]] 
    #gives pandas.core.indexes.base.Index            
    #check if column next to given row index is no empty             
            if pd.isnull(Dict['word2'].iloc[ind]) is False:
                #match keyword in second column
                tempM2 = {x for x in Dict['word2'] if x in item}
                #if second keyword was found
                if tempM2 != None: 
                    motive.append('keyword pattern A')
                else: 
            #check again first keyword column
                    tempM3 = {x for x in Dict['word1'] if x in item}
                    if tempM3 != None:
                        motive.append('keyword pattern A')
                    else: 
                        motive.append('unknown')

如何调整上面的代码？

我了解正则表达式（RegEx）。在我看来，考虑到关键字的数量（大约 700 到 1000 个）以及它们之间的组合，它将需要更多的代码行并且效率较低。不过很高兴被证明是错误的！

另外，我知道它可以被视为一个分类问题。该项目需要解释和透明度，因此深度学习及其类型不是一种选择。出于同样的原因，我不考虑嵌入。

谢谢！

Answer 1

您可以利用

all()

和

any()

来查找短语是否包含“所有”匹配列表中的“任何”匹配项吗？

phrases_to_find = [
    [
        ["dealt"],
        ["sales representative", "sales rep", "customer rep"]
    ],
    [
        ["option"]
    ]
]

texts = [
    "The sales representative dealt with everything.",
    "The sales rep dealt with everything.",
    "The agent answered all questions" ,
    "The customer rep answered all questions.",
    "The agent dealt with everything.",
    "Here is some option."
]

motive =[]
for text in texts:
    for index, test_phrases in enumerate(phrases_to_find):
        if all(any(p in text for p in phrase) for phrase in test_phrases):
            motive.append(f'keyword pattern {index}')
            break
    else:
        motive.append('unknown')

print(motive)

这应该给你：

[
    'keyword pattern 0',
    'keyword pattern 0',
    'unknown',
    'unknown',
    'unknown',
    'keyword pattern 1'
]

在Python中匹配大型文本语料库中数百个不相邻的关键字

问题描述投票：0回答：1

1个回答

最新问题

在Python中匹配大型文本语料库中数百个不相邻的关键字

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1