如何识别字符串数据集中的文本模板模式？

Question

我试图找到一种有效的方法来处理文本记录列表并识别记录中常用的文本模板，只保留固定部分并抽象变量，还计算与每个识别模板匹配的记录数。

——

我在解决这一挑战方面最成功的尝试是将文本记录拆分为单词数组，比较每个单词大小相同的数组，以便将模板中的模板写入模板列表中。

正如您所料，它不是完美的，并且难以运行超过50,000条记录的数据集。

我想知道是否有一些文本分类库可以提高效率或更快的逻辑来提高性能，我目前的代码非常幼稚......

——

这是我在Python中的第一次尝试，使用了一个非常简单的逻辑。

samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

samples_split = [x.split() for x in samples]
identified_templates = []

for words_list in samples_split:
    for j,words_list_ref in enumerate(samples_split):
         template = str()
         if len(words_list) != len(words_list_ref) or words_list==words_list_ref:
            continue
         else:
            for i,word in enumerate(words_list):
                if word == words_list_ref[i]:
                    template += ' '+word
                else:
                    template += ' %'
            identified_templates.append(template)

templates = dict()          
for template in identified_templates:
    if template not in templates.keys():
        templates[template]=1

templates_2 = dict()

for key, value in templates.items():
    if '% % %' not in key:
        templates_2[key]=1

print(templates_2)

理想情况下，代码应该采取如下输入：

- “Your order tracking number is 123” 
- “Thank you for creating an account with us” 
- “Your order tracking number is 888”
- “Thank you for creating an account with us” 
- “Hello Jim, what is your issue?”
- “Hello Jack, what is your issue?”

并输出模板列表以及它们匹配的记录数。

- “Your order tracking number is {}”,2
- “Thank you for creating an account with us”,2
- “Hello {}, what is your issue?”,2

Answer 1

您可以尝试以下代码。我希望输出符合您的期望。

import re
templates_2 = {}
samples = ['Your order 12345 has been confirmed. Thank you',
'Your order 12346 has been confirmed. Thank you',
'Your order 12347 has been confirmed. Thank you',
'Your order 12348 has been confirmed. Thank you',
'Your order 12349 has been confirmed. Thank you',
'The code for your bakery purchase is 1234',
'The code for your bakery purchase is 1237',
'The code for your butcher purchase is 1232',
'The code for your butcher purchase is 1231',
'The code for your gardening purchase is 1235']

identified_templates = [re.sub('[0-9]+', '{}', asample) for asample in samples]
unique_identified_templates = list(set(identified_templates))
for atemplate in unique_identified_templates:
    templates_2.update({atemplate:identified_templates.count(atemplate)})
for k, v in templates_2.items():
    print(k,':',v)

输出：

The code for your gardening purchase is {} : 1
Your order {} has been confirmed. Thank you : 5
The code for your bakery purchase is {} : 2
The code for your butcher purchase is {} : 2

如何识别字符串数据集中的文本模板模式？

问题描述投票：0回答：1

1个回答

最新问题

如何识别字符串数据集中的文本模板模式？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1