获取HTML的最终结果文本

问题描述 投票:1回答:2

我有以下字符串:

html = '<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li></ol>'

是否有任何Python库可以将其转换为以下字符串?

'a. hello'
python html html-parsing
2个回答
0
投票

我不知道此库,但是您仍然可以使用beautifulsoup和内置的字符串。这是一种边缘情况,因为只有很多不同的列表样式类型。

您正在解析的字符串是结构化的,如果存在这些特殊的列表样式类型,它将告诉您样式标记中它们将属于哪种类型。

使用BeautifulSoup和内置字符串

from bs4 import BeautifulSoup
import string


html = """
<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li><li>hi</li><li>Hey</li></ol>
"""

soup = BeautifulSoup(html, "lxml")

# Some obscure hidden tag
hidden_tag = "list-style-type:"

# The style that was listed if there are any
style = None
# First parse the style tags
for s in soup.find_all("style"):
    # If the style tag is for a list, and the hidden tag type is there
    if "li" in s.text and hidden_tag in s.text:
        text = s.text
        # Grab whatever is from the end of the hidden tag, to the first instance of ";" 
        # and strip white space
        style = text[text.find(hidden_tag) + len(hidden_tag):text.find(";")].strip()
# Create a list of the different style types you could encounter.
# For this, I just used lower/upper-alpha but you could use ascii character codes to represent many other styles

style_sets = {
    "lower-alpha": list(string.ascii_lowercase),
    "upper-alpha": list(string.ascii_uppercase),
    # Etc. for whatever possible styles you might encounter
}

# Iterate through the soup as you normally would using bs4.
for ol in soup.find_all("ol"):
    ind = 0
    for li in ol.find_all("li"):
        # If the style type is found we print it using that style type
        if style is not None:
            print("{}. {}".format(style_sets[style][ind], li.text))
            ind += 1
        else:
            print(li.text)

这将为您提供以下的最终字符串:

a. hello
b. hi
c. Hey
a. hello2
b. hi2
c. Hey2

0
投票

您可以使用像Selenium Webdriver这样的无头浏览器来执行此操作,因为我们需要使用Window.getComputedStyle()来查看哪些ol li项具有lower-alphalist-style-type值。无法获取列表项算术/ alpha索引的文本。

我们可以根据CSS和HTML参数生成这些数字。 HTML列表可能会变得非常复杂,因为它们可能包含26个以上的项目,其中的字母必须为aa.ab.等。还有startreversed ol attributesol定义从哪里开始订单,例如,对于start,计数将从字母<ol start="3">开始。 c属性以相反的顺序显示列表计数reversedc.b.等。我们需要解决两种情况。

使用Chrome Webdriver安装Selenium的说明

  1. 使用a.安装Selenium:

    pip
  2. 下载Chrome Webdriver pip3 install selenium 并添加到系统here。请小心选择本地Chrome安装的版本。最新的Chrome是PATH

Python脚本刮取排序的列表项并生成80.0.3987计数器

在此脚本中,我使用了实时URL,但是您可以在脚本的底部进行检查,我解释了如何使用80.0.3987来向Webdriver传递一些自定义HTML,如您所拥有的。

lower-alpha

结果

如果您尝试分析包含多个列表的页面,例如'data:text/html;'

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import string


def get_content(link):
    driver.get(link)

    # Get all page ordered lists
    for ol in WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ol"))):
        # Get all items from the current ordered list
        list_items = ol.find_elements_by_css_selector("li")
        list_items_count = len(list_items)

        # Get the list start attribute, will return 1 if not present
        ol_start = int(ol.get_attribute("start"))

        # Get the list reversed attribute, will return None if not present
        ol_reversed = ol.get_attribute("reversed")

        # Print information about the ordered list
        print("OL with %s items starting at %s, reversed: %s" % (
            len(list_items), 
            ol_start, 
            "yes" if ol_reversed else "no"))

        # Counter for the letters.
        # If the list is reversed begin count from the last item to the first,
        # else count from first (start) to last
        li_letter = list_items_count if ol_reversed else ol_start

        # Keep count how many list items found with lower-alpha list-style-type
        list_items_found = 0

        for li in list_items:
            # Execute javascript getComputedStyle to get the list item computed style
            list_style_type = driver.execute_script("return window.getComputedStyle(arguments[0])['list-style-type']", li)

            # If the list item computed style 'list-style-type' has 'lower-alpha' value
            if list_style_type == "lower-alpha":
                # Print generated alpha counter and the item text
                print("%s. %s" % (get_alpha_num(li_letter), li.text))

                # If the list is reversed, decrease letter by 1, else increase it
                li_letter += -1 if ol_reversed else 1

                # Keep counting how many items found with 'lower-alpha'
                list_items_found += 1

        # If no items with 'lower-alpha' found do something
        if list_items_found == 0:
            print("No list items found with 'lower-alpha' list style type")
        print()

# Function to convert numbers to letters 1 => a, 26 => aa
def get_alpha_num(num):
    letters = string.ascii_lowercase
    letters_count = len(letters)
    result = ''
    cnum = num - 1

    while(cnum // letters_count > 0):
        cnum //= letters_count
        result += list(letters)[cnum - 1]

    result += list(letters)[((num - 1) % letters_count)]
    return result

if __name__ == '__main__':
    URL = 'https://zikro.gr/dbg/html/lists.html'

    # If you want to parse HTML code from a string
    # then you can use a 'data:text/html;' URL with the HTML contents like this:
    # 
    # html_content = '<style>li { list-style-type: lower-alpha; }</style> <ol><li>hello</li><li>there</li></ol>'
    # URL = "data:text/html;charset=utf-8,{html_content}".format(html_content=html_content)
    # 
    # Will result to this:
    #  OL with 2 items starting at 1, reversed: no
    #  a. hello
    #  b. there

    chrome_options = Options()

    # Make headless
    # chrome_options.add_argument("--headless")

    with webdriver.Chrome(options=chrome_options) as driver:
       get_content(URL)
this one

您将得到这样的结果:

#ol-a-css li {
  list-style-type: lower-alpha;
}
© www.soinside.com 2019 - 2024. All rights reserved.