在Python中匹配大型文本语料库中数百个不相邻的关键字

问题描述 投票:0回答:1

我需要在大量文本(数千个)中匹配不相邻的关键字。如果匹配,则分配一个标签,否则分配一个标签“未知”。

举个例子,我想在下面的文本片段中找到关键字销售代表交易,并将其分配给类别关键字模式A

文本:“销售代表处理了一切。知道他为我编制了最佳选择,这非常有帮助。”

  • 关键字模式因此是销售代表交易
  • 由于销售代表也可能称为销售代表或客户代表,因此我需要匹配多个关键字。对于所处理的词也是如此。所以你看到事情变得复杂的地方了。

有许多解决方案可用于查找和匹配一元词或相邻词(n 元词)。我自己已经实现了这个。现在我需要找到不相邻的不同关键字并分配标签。另外,我不知道不同关键字之间写了什么。它可以是任何东西。


我正在使用词汇方法来解决这个问题,在具有不同列的字典中查找关键字,以适应单个关键字、两个关键字或三个关键字的匹配。请注意,关键字始终是一元组或二元组。另外,我不知道关键字之间写了什么。下面是我写的一些代码。
import pandas as pd 

#creat mock dictionary
Dict = pd.DataFrame({'word1':['dealt','dealt','dealt',''],
                     'word2':['sales representative','sales rep', 'customer rep', 'options']
                      }  )

#create sample text 
texts = ["The sales representative dealt with everything.",
"The sales rep dealt with everything.",
"The agent answered all questions" ,
"The customer rep answered all questions.",
"The agent dealt with everything."]

motive =[]
# only checks for the keyword in the first column  
for item in texts:
    item = str(item)
    if any(x in item for x in Dict['word1']):
    motive.append('keyword pattern A')        
    else:
        motive.append('unkown')

仅当文本中出现 dealtsalesrep 时才应分配标签。因此句子 3 和 5 的分配是错误的。所以我更新了代码。我跑完了,但没有分配任何标签。

for item in texts:
    #convert into string
    item = str(item)
    #check if keyword can be found in first column
    tempM1 = {x for x in Dict['word1'] if x in item}
    #check if keyword was found
    if tempM1 != None:
        #if yes, locate all of their positions in the dictionary 
        for i in tempM1:
            i = -1
            #get row index 
            ind = Dict.index[Dict['word1'] == list(tempM1)[i+1]] 
    #gives pandas.core.indexes.base.Index            
    #check if column next to given row index is no empty             
            if pd.isnull(Dict['word2'].iloc[ind]) is False:
                #match keyword in second column
                tempM2 = {x for x in Dict['word2'] if x in item}
                #if second keyword was found
                if tempM2 != None: 
                    motive.append('keyword pattern A')
                else: 
            #check again first keyword column
                    tempM3 = {x for x in Dict['word1'] if x in item}
                    if tempM3 != None:
                        motive.append('keyword pattern A')
                    else: 
                        motive.append('unknown')

如何调整上面的代码?

我了解正则表达式(RegEx)。在我看来,考虑到关键字的数量(大约 700 到 1000 个)以及它们之间的组合,它将需要更多的代码行并且效率较低。不过很高兴被证明是错误的!

另外,我知道它可以被视为一个分类问题。该项目需要解释和透明度,因此深度学习及其类型不是一种选择。出于同样的原因,我不考虑嵌入。

谢谢!

python loops match nested-loops
1个回答
0
投票

您可以利用

all()
any()
来查找短语是否包含“所有”匹配列表中的“任何”匹配项吗?

phrases_to_find = [
    [
        ["dealt"],
        ["sales representative", "sales rep", "customer rep"]
    ],
    [
        ["option"]
    ]
]

texts = [
    "The sales representative dealt with everything.",
    "The sales rep dealt with everything.",
    "The agent answered all questions" ,
    "The customer rep answered all questions.",
    "The agent dealt with everything.",
    "Here is some option."
]

motive =[]
for text in texts:
    for index, test_phrases in enumerate(phrases_to_find):
        if all(any(p in text for p in phrase) for phrase in test_phrases):
            motive.append(f'keyword pattern {index}')
            break
    else:
        motive.append('unknown')

print(motive)

这应该给你:

[
    'keyword pattern 0',
    'keyword pattern 0',
    'unknown',
    'unknown',
    'unknown',
    'keyword pattern 1'
]
© www.soinside.com 2019 - 2024. All rights reserved.