我决定使用以下方法....不知道最有效但它有效。
test5 = string.replace(">, <", ">|<")
options = test5.split("|")
这种方法不需要在适当的位置设置 html 字符串
通常我不会建议在与 XML/HTML 相关的任何事情上使用正则表达式,但是由于您输入的是一些经过处理的形式并且不再有效,我会说在这种情况下使用正则表达式是可以接受的,如果您无法修复它在数据源:
import re
s = '<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>, <div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>, <div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>, <div class="options mceEditable">The proteins may either be carriers or receptors only</div>, <div class="options mceEditable">It is a 3-layered lipid structure</div>'
pattern = r'<div class="options mceEditable">.*?<\/div>'
matches = re.findall(pattern, s, re.U)
for m in matches:
print(m)
输出:
<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>
<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>
<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>
<div class="options mceEditable">The proteins may either be carriers or receptors only</div>
<div class="options mceEditable">It is a 3-layered lipid structure</div>
可以用beautifulsoup
# pip install bs4
import bs4
soup = bs4.BeautifulSoup(s)
divs = soup.find_all('div')
输出:
>>> divs
[<div class="options mceEditable">The membrane is a dynamic structure, and its constituents are in constant movement.</div>,
<div class="options mceEditable">The lipids component of the membrane constitutes a bilayer of hydrophilic ends</div>,
<div class="options mceEditable">The lipid content of the membrane is more than that of the protein</div>,
<div class="options mceEditable">The proteins may either be carriers or receptors only</div>,
<div class="options mceEditable">It is a 3-layered lipid structure</div>]