我有一个已保存的html文件,我想从这个文件中找出某个字符串被找到的次数。例如
string= 'Beautiful days'
text = "those beautiful days were unforgettable. I wish every day was a beautiful day"
output expected = 2 (beautiful days, beautiful day)
尝试了以下方法:我试过使用spacy,但无法做到,谁能告诉我这其中的逻辑?
你可以使用 stemmer。它可能是多余的,但它也能找到最接近的词。
import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "those beautiful days were unforgettable. I wish every day was a beautiful day"
words = word_tokenize(sentence)
sentence = ""
for w in words:
sentence += (ps.stem(w.lower()) + " ")
query = 'Beautiful days'
words = word_tokenize(query)
query = ""
for w in words:
query += (ps.stem(w.lower()) + " ")
print(sentence)
print(query)
print(sentence.count(query))
those beauti day were unforgett . i wish everi day wa a beauti day
beauti day
2
您也可以使用。
import re
with open("count_string_in_file.txt") as f:
html = f.read()
to_match = "beautiful day"
matches = re.findall(to_match, html, re.IGNORECASE)
print(len(matches), matches)
# 2 ['beautiful day', 'beautiful day']