我有以下字符串:
html = '<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li></ol>'
是否有任何Python库可以将其转换为以下字符串?
'a. hello'
我不知道此库,但是您仍然可以使用beautifulsoup和内置的字符串。这是一种边缘情况,因为只有很多不同的列表样式类型。
您正在解析的字符串是结构化的,如果存在这些特殊的列表样式类型,它将告诉您样式标记中它们将属于哪种类型。
使用BeautifulSoup和内置字符串
from bs4 import BeautifulSoup
import string
html = """
<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li><li>hi</li><li>Hey</li></ol>
"""
soup = BeautifulSoup(html, "lxml")
# Some obscure hidden tag
hidden_tag = "list-style-type:"
# The style that was listed if there are any
style = None
# First parse the style tags
for s in soup.find_all("style"):
# If the style tag is for a list, and the hidden tag type is there
if "li" in s.text and hidden_tag in s.text:
text = s.text
# Grab whatever is from the end of the hidden tag, to the first instance of ";"
# and strip white space
style = text[text.find(hidden_tag) + len(hidden_tag):text.find(";")].strip()
# Create a list of the different style types you could encounter.
# For this, I just used lower/upper-alpha but you could use ascii character codes to represent many other styles
style_sets = {
"lower-alpha": list(string.ascii_lowercase),
"upper-alpha": list(string.ascii_uppercase),
# Etc. for whatever possible styles you might encounter
}
# Iterate through the soup as you normally would using bs4.
for ol in soup.find_all("ol"):
ind = 0
for li in ol.find_all("li"):
# If the style type is found we print it using that style type
if style is not None:
print("{}. {}".format(style_sets[style][ind], li.text))
ind += 1
else:
print(li.text)
这将为您提供以下的最终字符串:
a. hello
b. hi
c. Hey
a. hello2
b. hi2
c. Hey2
您可以使用像Selenium Webdriver这样的无头浏览器来执行此操作,因为我们需要使用Window.getComputedStyle()来查看哪些ol li
项具有lower-alpha
的list-style-type
值。无法获取列表项算术/ alpha索引的文本。
我们可以根据CSS和HTML参数生成这些数字。 HTML列表可能会变得非常复杂,因为它们可能包含26个以上的项目,其中的字母必须为aa.
,ab.
等。还有start
和reversed
ol
attributes。 ol
定义从哪里开始订单,例如,对于start
,计数将从字母<ol start="3">
开始。 c
属性以相反的顺序显示列表计数reversed
,c.
,b.
等。我们需要解决两种情况。
使用a.
安装Selenium:
pip
下载Chrome Webdriver pip3 install selenium
并添加到系统here。请小心选择本地Chrome安装的版本。最新的Chrome是PATH
。
80.0.3987
计数器在此脚本中,我使用了实时URL,但是您可以在脚本的底部进行检查,我解释了如何使用80.0.3987
来向Webdriver传递一些自定义HTML,如您所拥有的。
lower-alpha
如果您尝试分析包含多个列表的页面,例如'data:text/html;'
:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import string
def get_content(link):
driver.get(link)
# Get all page ordered lists
for ol in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ol"))):
# Get all items from the current ordered list
list_items = ol.find_elements_by_css_selector("li")
list_items_count = len(list_items)
# Get the list start attribute, will return 1 if not present
ol_start = int(ol.get_attribute("start"))
# Get the list reversed attribute, will return None if not present
ol_reversed = ol.get_attribute("reversed")
# Print information about the ordered list
print("OL with %s items starting at %s, reversed: %s" % (
len(list_items),
ol_start,
"yes" if ol_reversed else "no"))
# Counter for the letters.
# If the list is reversed begin count from the last item to the first,
# else count from first (start) to last
li_letter = list_items_count if ol_reversed else ol_start
# Keep count how many list items found with lower-alpha list-style-type
list_items_found = 0
for li in list_items:
# Execute javascript getComputedStyle to get the list item computed style
list_style_type = driver.execute_script("return window.getComputedStyle(arguments[0])['list-style-type']", li)
# If the list item computed style 'list-style-type' has 'lower-alpha' value
if list_style_type == "lower-alpha":
# Print generated alpha counter and the item text
print("%s. %s" % (get_alpha_num(li_letter), li.text))
# If the list is reversed, decrease letter by 1, else increase it
li_letter += -1 if ol_reversed else 1
# Keep counting how many items found with 'lower-alpha'
list_items_found += 1
# If no items with 'lower-alpha' found do something
if list_items_found == 0:
print("No list items found with 'lower-alpha' list style type")
print()
# Function to convert numbers to letters 1 => a, 26 => aa
def get_alpha_num(num):
letters = string.ascii_lowercase
letters_count = len(letters)
result = ''
cnum = num - 1
while(cnum // letters_count > 0):
cnum //= letters_count
result += list(letters)[cnum - 1]
result += list(letters)[((num - 1) % letters_count)]
return result
if __name__ == '__main__':
URL = 'https://zikro.gr/dbg/html/lists.html'
# If you want to parse HTML code from a string
# then you can use a 'data:text/html;' URL with the HTML contents like this:
#
# html_content = '<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li><li>there</li></ol>'
# URL = "data:text/html;charset=utf-8,{html_content}".format(html_content=html_content)
#
# Will result to this:
# OL with 2 items starting at 1, reversed: no
# a. hello
# b. there
chrome_options = Options()
# Make headless
# chrome_options.add_argument("--headless")
with webdriver.Chrome(options=chrome_options) as driver:
get_content(URL)
this one
您将得到这样的结果:
#ol-a-css li {
list-style-type: lower-alpha;
}