我从一个使用授权的站点制作了一个解析器,最后我得到了一个包含必要信息和 html 垃圾的通用块。如何只留下像
INEEDTHISTEXT
这样的文字?
我现在拥有的:
[<div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
<div class="dnevnik-lesson__attach">
<a class="button button--outline button--purple" href="example.link"><i class="fal fa-fw fa-file-powerpoint"></i><span class="button__title">example.pptx</span></a>
</div>
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
<div class="dnevnik-lesson__attach">
<a class="button button--outline button--purple" href="линкf"><i class="fal fa-fw fa-file-pdf"></i><span class="button__title">тесты Чеботарева-113-114.pdf</span></a>
</div>
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>]
我需要什么:
INEEDTHISTEXT INEEDTHISTEXT
等等
我尝试了replace(),[cut:],但没有成功。
您应该使用 HTML 解析器,例如 Beautiful Soup,但这里有一个简单的正则表达式,应该适用于这种情况。它可能不适用于所有网页:
import re
html = """[<div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
...
"""
print(re.findall(r">([\w]+)\s+<", html))
这将打印
['INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT']