我正在编写一段代码来查找关键字列表及其相应的合同类型文件之间的所有常见单词。关键词列表词已存储在集合中,合同词已存储在词典中。我正在努力编写代码,每当在关键字列表和相应的合同类型文件中找到常用词时,该代码就会递增一。目前,它仅找到每个关键字列表单词的一个实例。我已经包含了整个代码,以防万一不同部分出现问题并影响对应部分。
common_words_dict = Counter()
for contract_type_identifier in contract_type_sets:
if contract_type_identifier == keyword_list_identifier:
print(f"Processing contract type: {contract_type_identifier}")
words_for_contract_type = contract_type_sets.get(contract_type_identifier, set())
# Check if any word in keyword list is present in cleaned_split_words
for word in keyword_list_words:
if word in words_for_contract_type:
# Check if the word is already in the dictionary, and if not, initialize it with count 0
if word not in common_words_dict:
print(f"Common word found: {word}")
common_words_dict[word] = 0
# Increment the count for the word
common_words_dict[word] += 1
# Print the resulting dictionary
print(common_words_dict)
让我们举一个简单的例子,使用维基百科中的几个页面。
import requests
import re
from collections import Counter
from bs4 import BeautifulSoup
url_a = 'https://en.wikipedia.org/wiki/Leonhard_Euler'
url_b = 'https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss'
txt_a = BeautifulSoup(requests.get(url_a).content, 'html.parser').get_text()
txt_b = BeautifulSoup(requests.get(url_b).content, 'html.parser').get_text()
我们可以统计每个文本中每个单词的出现次数:
cnt_a = Counter(re.split(r'\W+', txt_a.lower()))
cnt_b = Counter(re.split(r'\W+', txt_b.lower()))
然后,为了好玩,让我们查找两个文本之间的所有共同词:
common = dict(sorted({
k: (cnt_a[k], cnt_b[k]) for k in cnt_a & cnt_b
}.items(), key=lambda kv: sum(kv[1]), reverse=True))
>>> common
{'the': (596, 991),
'of': (502, 874),
'in': (226, 576),
'and': (215, 365),
'a': (191, 349),
'to': (171, 278),
'gauss': (4, 428),
'euler': (336, 6),
'his': (82, 238),
's': (161, 123),
'he': (81, 200),
'with': (91, 163),
...}
但是如果我们只想查找某些单词怎么办?
wordset = {'mathematical', 'equation', 'theorem', 'university', 'integral'}
>>> {k: cnt_a.get(k, 0) for k in wordset}
{'equation': 18,
'integral': 17,
'theorem': 23,
'university': 24,
'mathematical': 49}
>>> {k: cnt_b.get(k, 0) for k in wordset}
{'equation': 3,
'integral': 6,
'theorem': 34,
'university': 38,
'mathematical': 32}