如何从变量中删除多余的文本? [Python]

问题描述 投票:0回答:1

我从一个使用授权的站点制作了一个解析器,最后我得到了一个包含必要信息和 html 垃圾的通用块。如何只留下像

INEEDTHISTEXT
这样的文字?

我现在拥有的:

[<div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
                <div class="dnevnik-lesson__attach">
<a class="button button--outline button--purple" href="example.link"><i class="fal fa-fw fa-file-powerpoint"></i><span class="button__title">example.pptx</span></a>
</div>
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT 
 
 
                <div class="dnevnik-lesson__attach">
<a class="button button--outline button--purple" href="линкf"><i class="fal fa-fw fa-file-pdf"></i><span class="button__title">тесты Чеботарева-113-114.pdf</span></a>
</div>
</div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>, <div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
 
 
 
            </div>]

我需要什么:

INEEDTHISTEXT INEEDTHISTEXT

等等

我尝试了replace(),[cut:],但没有成功。

python parsing
1个回答
0
投票

您应该使用 HTML 解析器,例如 Beautiful Soup,但这里有一个简单的正则表达式,应该适用于这种情况。它可能不适用于所有网页:

import re

html = """[<div class="dnevnik-lesson__task">
<i class="dnevnik-lesson-icon"></i>INEEDTHISTEXT
...
"""

print(re.findall(r">([\w]+)\s+<", html))

这将打印

['INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT',
 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT', 'INEEDTHISTEXT']
© www.soinside.com 2019 - 2024. All rights reserved.